draft-ietf-nfsv4-minorversion1-11.txt   draft-ietf-nfsv4-minorversion1-12.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: December 13, 2007 Editors Expires: December 3, 2007 Editors
June 11, 2007
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-11.txt draft-ietf-nfsv4-minorversion1-12.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 33
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on December 13, 2007. This Internet-Draft will expire on December 3, 2007.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2007). Copyright (C) The IETF Trust (2007).
Abstract Abstract
This Internet-Draft describes NFSv4 minor version one, including This Internet-Draft describes NFSv4 minor version one, including
features retained from the base protocol and protocol extensions made features retained from the base protocol and protocol extensions made
subsequently. The current draft includes description of the major subsequently. The current draft includes description of the major
skipping to change at page 3, line 8 skipping to change at page 3, line 8
2.9.1. Required and Recommended Properties of Transports . 35 2.9.1. Required and Recommended Properties of Transports . 35
2.9.2. Client and Server Transport Behavior . . . . . . . . 36 2.9.2. Client and Server Transport Behavior . . . . . . . . 36
2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37
2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37
2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38
2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 40 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 40
2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 41 2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 41
2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 44 2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 44
2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 56 2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 56
2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 59 2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 58
2.10.8. Session Mechanics - Steady State . . . . . . . . . . 67 2.10.8. Session Mechanics - Steady State . . . . . . . . . . 67
2.10.9. Session Mechanics - Recovery . . . . . . . . . . . . 68 2.10.9. Session Mechanics - Recovery . . . . . . . . . . . . 69
2.10.10. Parallel NFS and Sessions . . . . . . . . . . . . . 72 2.10.10. Parallel NFS and Sessions . . . . . . . . . . . . . 72
3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 72 3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 72
3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 72 3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 72
3.2. Structured Data Types . . . . . . . . . . . . . . . . . 74 3.2. Structured Data Types . . . . . . . . . . . . . . . . . 74
4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 83 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 84 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 84
4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 84 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 84
4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 84 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 84
4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 85 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 85
4.2.1. General Properties of a Filehandle . . . . . . . . . 85 4.2.1. General Properties of a Filehandle . . . . . . . . . 85
skipping to change at page 4, line 4 skipping to change at page 4, line 4
5.13.2. layout_alignment . . . . . . . . . . . . . . . . . . 109 5.13.2. layout_alignment . . . . . . . . . . . . . . . . . . 109
5.13.3. layout_blksize . . . . . . . . . . . . . . . . . . . 110 5.13.3. layout_blksize . . . . . . . . . . . . . . . . . . . 110
5.13.4. layout_hint . . . . . . . . . . . . . . . . . . . . 110 5.13.4. layout_hint . . . . . . . . . . . . . . . . . . . . 110
5.13.5. layout_type . . . . . . . . . . . . . . . . . . . . 110 5.13.5. layout_type . . . . . . . . . . . . . . . . . . . . 110
5.13.6. mdsthreshold . . . . . . . . . . . . . . . . . . . . 110 5.13.6. mdsthreshold . . . . . . . . . . . . . . . . . . . . 110
5.14. Retention Attributes . . . . . . . . . . . . . . . . . . 111 5.14. Retention Attributes . . . . . . . . . . . . . . . . . . 111
6. Security Related Attributes . . . . . . . . . . . . . . . . . 113 6. Security Related Attributes . . . . . . . . . . . . . . . . . 113
6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2. File Attributes Discussion . . . . . . . . . . . . . . . 114 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 114
6.2.1. ACL Attributes . . . . . . . . . . . . . . . . . . . 114 6.2.1. ACL Attributes . . . . . . . . . . . . . . . . . . . 114
6.2.2. dacl and sacl Attributes . . . . . . . . . . . . . . 126 6.2.2. dacl and sacl Attributes . . . . . . . . . . . . . . 127
6.2.3. mode Attribute . . . . . . . . . . . . . . . . . . . 127 6.2.3. mode Attribute . . . . . . . . . . . . . . . . . . . 127
6.2.4. mode_set_masked Attribute . . . . . . . . . . . . . 127 6.2.4. mode_set_masked Attribute . . . . . . . . . . . . . 128
6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 128 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 129
6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 128 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 129
6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 129 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 130
6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 131 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 131
6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 131 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 132
6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 132 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 133
6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 133 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 134
7. Single-server Name Space . . . . . . . . . . . . . . . . . . 137 7. Single-server Name Space . . . . . . . . . . . . . . . . . . 138
7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 137 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 138
7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 137 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 138
7.3. Server Pseudo File System . . . . . . . . . . . . . . . 138 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 139
7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 138 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 139
7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 139 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 139
7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 139 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 140
7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 139 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 140
7.8. Security Policy and Name Space Presentation . . . . . . 140 7.8. Security Policy and Name Space Presentation . . . . . . 141
8. File Locking and Share Reservations . . . . . . . . . . . . . 140 8. File Locking and Share Reservations . . . . . . . . . . . . . 141
8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 141 8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 142
8.1.1. Client and Session ID . . . . . . . . . . . . . . . 141 8.1.1. Client and Session ID . . . . . . . . . . . . . . . 142
8.1.2. State-owner Definition . . . . . . . . . . . . . . . 142 8.1.2. State-owner Definition . . . . . . . . . . . . . . . 142
8.1.3. Stateid Definition . . . . . . . . . . . . . . . . . 142 8.1.3. Stateid Definition . . . . . . . . . . . . . . . . . 143
8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 146 8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 147
8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 148 8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 149
8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 149 8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 150
8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 149 8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 150
8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 150 8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 151
8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 150 8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 152
8.6.1. Client Failure and Recovery . . . . . . . . . . . . 151 8.6.1. Client Failure and Recovery . . . . . . . . . . . . 152
8.6.2. Server Failure and Recovery . . . . . . . . . . . . 151 8.6.2. Server Failure and Recovery . . . . . . . . . . . . 152
8.6.3. Network Partitions and Recovery . . . . . . . . . . 155 8.6.3. Network Partitions and Recovery . . . . . . . . . . 156
8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 159 8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 160
8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 160 8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 161
8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 161 8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 162
8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 161 8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 162
8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 162 8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 163
8.12. Clocks, Propagation Delay, and Calculating Lease 8.12. Clocks, Propagation Delay, and Calculating Lease
Expiration . . . . . . . . . . . . . . . . . . . . . . . 162 Expiration . . . . . . . . . . . . . . . . . . . . . . . 164
8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 163 8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 164
9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 164 9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 165
9.1. Performance Challenges for Client-Side Caching . . . . . 164 9.1. Performance Challenges for Client-Side Caching . . . . . 166
9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 165 9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 166
9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 167 9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 168
9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 169 9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 170
9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 169 9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 170
9.3.2. Data Caching and File Locking . . . . . . . . . . . 170 9.3.2. Data Caching and File Locking . . . . . . . . . . . 171
9.3.3. Data Caching and Mandatory File Locking . . . . . . 172 9.3.3. Data Caching and Mandatory File Locking . . . . . . 173
9.3.4. Data Caching and File Identity . . . . . . . . . . . 172 9.3.4. Data Caching and File Identity . . . . . . . . . . . 173
9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 173 9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 174
9.4.1. Open Delegation and Data Caching . . . . . . . . . . 175 9.4.1. Open Delegation and Data Caching . . . . . . . . . . 177
9.4.2. Open Delegation and File Locks . . . . . . . . . . . 177 9.4.2. Open Delegation and File Locks . . . . . . . . . . . 178
9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 177 9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 178
9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 180 9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 181
9.4.5. Clients that Fail to Honor Delegation Recalls . . . 182 9.4.5. Clients that Fail to Honor Delegation Recalls . . . 183
9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 183 9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 184
9.5. Data Caching and Revocation . . . . . . . . . . . . . . 183 9.5. Data Caching and Revocation . . . . . . . . . . . . . . 184
9.5.1. Revocation Recovery for Write Open Delegation . . . 184 9.5.1. Revocation Recovery for Write Open Delegation . . . 185
9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 185 9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 186
9.7. Data and Metadata Caching and Memory Mapped Files . . . 187 9.7. Data and Metadata Caching and Memory Mapped Files . . . 188
9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 189 9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 190
9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 190 9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 191
10. Multi-Server Name Space . . . . . . . . . . . . . . . . . . . 191 10. Multi-Server Name Space . . . . . . . . . . . . . . . . . . . 192
10.1. Location attributes . . . . . . . . . . . . . . . . . . 191 10.1. Location attributes . . . . . . . . . . . . . . . . . . 192
10.2. File System Presence or Absence . . . . . . . . . . . . 191 10.2. File System Presence or Absence . . . . . . . . . . . . 192
10.3. Getting Attributes for an Absent File System . . . . . . 193 10.3. Getting Attributes for an Absent File System . . . . . . 194
10.3.1. GETATTR Within an Absent File System . . . . . . . . 193 10.3.1. GETATTR Within an Absent File System . . . . . . . . 194
10.3.2. READDIR and Absent File Systems . . . . . . . . . . 194 10.3.2. READDIR and Absent File Systems . . . . . . . . . . 195
10.4. Uses of Location Information . . . . . . . . . . . . . . 195 10.4. Uses of Location Information . . . . . . . . . . . . . . 196
10.4.1. File System Replication . . . . . . . . . . . . . . 195 10.4.1. File System Replication . . . . . . . . . . . . . . 196
10.4.2. File System Migration . . . . . . . . . . . . . . . 197 10.4.2. File System Migration . . . . . . . . . . . . . . . 198
10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 198 10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 199
10.5. Additional Client-side Considerations . . . . . . . . . 199 10.5. Additional Client-side Considerations . . . . . . . . . 200
10.6. Effecting File System Transitions . . . . . . . . . . . 200 10.6. Effecting File System Transitions . . . . . . . . . . . 201
10.6.1. File System Transitions and Simultaneous Access . . 201 10.6.1. File System Transitions and Simultaneous Access . . 202
10.6.2. Simultaneous Use and Transparent Transitions . . . . 202 10.6.2. Simultaneous Use and Transparent Transitions . . . . 203
10.6.3. Filehandles and File System Transitions . . . . . . 204 10.6.3. Filehandles and File System Transitions . . . . . . 205
10.6.4. Fileid's and File System Transitions . . . . . . . . 204 10.6.4. Fileid's and File System Transitions . . . . . . . . 205
10.6.5. Fsids and File System Transitions . . . . . . . . . 205 10.6.5. Fsids and File System Transitions . . . . . . . . . 206
10.6.6. The Change Attribute and File System Transitions . . 205 10.6.6. The Change Attribute and File System Transitions . . 206
10.6.7. Lock State and File System Transitions . . . . . . . 206 10.6.7. Lock State and File System Transitions . . . . . . . 207
10.6.8. Write Verifiers and File System Transitions . . . . 210 10.6.8. Write Verifiers and File System Transitions . . . . 211
10.7. Effecting File System Referrals . . . . . . . . . . . . 210 10.7. Effecting File System Referrals . . . . . . . . . . . . 211
10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 210 10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 211
10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 214 10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 215
10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 216 10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 217
10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 217 10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 218
10.10. The Attribute fs_locations_info . . . . . . . . . . . . 219 10.10. The Attribute fs_locations_info . . . . . . . . . . . . 220
10.10.1. The fs_locations_server4 Structure . . . . . . . . . 221 10.10.1. The fs_locations_server4 Structure . . . . . . . . . 222
10.10.2. The fs_locations_info4 Structure . . . . . . . . . . 226 10.10.2. The fs_locations_info4 Structure . . . . . . . . . . 227
10.10.3. The fs_locations_item4 Structure . . . . . . . . . . 227 10.10.3. The fs_locations_item4 Structure . . . . . . . . . . 228
10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 228 10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 229
11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 232 11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 233
11.1. Introduction to Directory Delegations . . . . . . . . . 232 11.1. Introduction to Directory Delegations . . . . . . . . . 233
11.2. Directory Delegation Design . . . . . . . . . . . . . . 233 11.2. Directory Delegation Design . . . . . . . . . . . . . . 234
11.3. Attributes in Support of Directory Notifications . . . . 234 11.3. Attributes in Support of Directory Notifications . . . . 235
11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 234 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 235
11.5. Directory Delegation Recovery . . . . . . . . . . . . . 234 11.5. Directory Delegation Recovery . . . . . . . . . . . . . 235
12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 234 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 235
12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 234 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 235
12.2. PNFS Definitions . . . . . . . . . . . . . . . . . . . . 236 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 237
12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 236 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 237
12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 236 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 237
12.2.3. Client . . . . . . . . . . . . . . . . . . . . . . . 237 12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 238
12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 237 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 238
12.2.5. Data Server . . . . . . . . . . . . . . . . . . . . 237 12.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 238
12.2.6. Storage Protocol or Data Protocol . . . . . . . . . 237 12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 238
12.2.7. Control Protocol . . . . . . . . . . . . . . . . . . 237 12.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 238
12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 238 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 239
12.2.9. Layout Types . . . . . . . . . . . . . . . . . . . . 238 12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 239
12.2.10. Layout Iomode . . . . . . . . . . . . . . . . . . . 238 12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 240
12.2.11. Layout Segment . . . . . . . . . . . . . . . . . . . 239 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 240
12.2.12. Device IDs . . . . . . . . . . . . . . . . . . . . . 240 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 241
12.3. PNFS Operations . . . . . . . . . . . . . . . . . . . . 240
12.4. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 241
12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 241 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 241
12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 241 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 242
12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 242 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 243
12.5.3. Committing a Layout . . . . . . . . . . . . . . . . 243 12.5.3. Committing a Layout . . . . . . . . . . . . . . . . 244
12.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 246 12.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 247
12.5.5. Metadata Server Write Propagation . . . . . . . . . 252 12.5.5. Metadata Server Write Propagation . . . . . . . . . 253
12.6. PNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 252 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 253
12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 253 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 254
12.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 253 12.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 255
12.7.2. Dealing with Lease Expiration on the Client . . . . 254 12.7.2. Dealing with Lease Expiration on the Client . . . . 255
12.7.3. Dealing with Loss of Layout State on the Metadata 12.7.3. Dealing with Loss of Layout State on the Metadata
Server . . . . . . . . . . . . . . . . . . . . . . . 255 Server . . . . . . . . . . . . . . . . . . . . . . . 257
12.7.4. Recovery from Metadata Server Restart . . . . . . . 256 12.7.4. Recovery from Metadata Server Restart . . . . . . . 257
12.7.5. Operations During Metadata Server Grace Period . . . 258 12.7.5. Operations During Metadata Server Grace Period . . . 259
12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 258 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 260
12.8. Metadata and Storage Device Roles . . . . . . . . . . . 259 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 260
12.9. Security Considerations . . . . . . . . . . . . . . . . 260 12.9. Security Considerations . . . . . . . . . . . . . . . . 262
13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 261 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 263
13.1. Session Considerations . . . . . . . . . . . . . . . . . 261 13.1. Client ID and Session Considerations . . . . . . . . . . 263
13.2. File Layout Definitions . . . . . . . . . . . . . . . . 263 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 264
13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 263 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 265
13.4. Interpreting the File Layout . . . . . . . . . . . . . . 267 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 268
13.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 269 13.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 270
13.6. Data Server Multipathing . . . . . . . . . . . . . . . . 271 13.6. Data Server Multipathing . . . . . . . . . . . . . . . . 271
13.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 271 13.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 272
13.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 272 13.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 272
13.9. Global Stateid Requirements . . . . . . . . . . . . . . 273 13.9. The Layout Iomode . . . . . . . . . . . . . . . . . . . 273
13.10. The Layout Iomode . . . . . . . . . . . . . . . . . . . 273 13.10. Metadata and Data Server State Coordination . . . . . . 274
13.11. Data Server State Propagation . . . . . . . . . . . . . 273 13.10.1. Global Stateid Requirements . . . . . . . . . . . . 274
13.11.1. Lock State Propagation . . . . . . . . . . . . . . . 274 13.10.2. Data Server State Propagation . . . . . . . . . . . 274
13.11.2. Open-mode Validation . . . . . . . . . . . . . . . . 274 13.11. Data Server Component File Size . . . . . . . . . . . . 276
13.11.3. File Attributes . . . . . . . . . . . . . . . . . . 275 13.12. Recovery from Loss of Layout . . . . . . . . . . . . . . 277
13.12. Data Server Component File Size . . . . . . . . . . . . 275 13.13. Security Considerations for the File Layout Type . . . . 278
13.13. Recovery Considerations . . . . . . . . . . . . . . . . 276 14. Internationalization . . . . . . . . . . . . . . . . . . . . 278
13.14. Security Considerations for the File Layout Type . . . . 277 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 279
14. Internationalization . . . . . . . . . . . . . . . . . . . . 277 14.2. Stringprep profile for the utf8str_cis type . . . . . . 281
14.1. Stringprep profile for the utf8str_cs type . . . . . . . 278 14.3. Stringprep profile for the utf8str_mixed type . . . . . 282
14.2. Stringprep profile for the utf8str_cis type . . . . . . 280 14.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 284
14.3. Stringprep profile for the utf8str_mixed type . . . . . 281 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 284
14.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 283 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 284
15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 283 15.2. Operations and their valid errors . . . . . . . . . . . 299
15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 283 15.3. Callback operations and their valid errors . . . . . . . 313
15.2. Operations and their valid errors . . . . . . . . . . . 298 15.4. Errors and the operations that use them . . . . . . . . 314
15.3. Callback operations and their valid errors . . . . . . . 312 16. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 321
15.4. Errors and the operations that use them . . . . . . . . 313 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 321
16. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 320 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 322
16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 320 17. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 327
16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 321 17.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 327
17. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 326 17.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 329
17.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 326 17.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 331
17.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 328 17.4. Operation 6: CREATE - Create a Non-Regular File Object . 333
17.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 330
17.4. Operation 6: CREATE - Create a Non-Regular File Object . 332
17.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 17.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 335 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 336
17.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 336 17.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 337
17.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 336 17.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 337
17.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 338 17.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 339
17.9. Operation 11: LINK - Create Link to a File . . . . . . . 339 17.9. Operation 11: LINK - Create Link to a File . . . . . . . 340
17.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 340 17.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 341
17.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 344 17.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 345
17.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 345 17.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 346
17.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 347 17.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 348
17.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 349 17.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 350
17.15. Operation 17: NVERIFY - Verify Difference in 17.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 350 Attributes . . . . . . . . . . . . . . . . . . . . . . . 351
17.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 351 17.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 352
17.17. Operation 19: OPENATTR - Open Named Attribute 17.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 366 Directory . . . . . . . . . . . . . . . . . . . . . . . 367
17.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 367 17.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 368
17.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 368 17.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 369
17.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 369 17.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 370
17.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 371 17.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 372
17.22. Operation 25: READ - Read from File . . . . . . . . . . 372 17.22. Operation 25: READ - Read from File . . . . . . . . . . 373
17.23. Operation 26: READDIR - Read Directory . . . . . . . . . 374 17.23. Operation 26: READDIR - Read Directory . . . . . . . . . 375
17.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 378 17.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 379
17.25. Operation 28: REMOVE - Remove File System Object . . . . 379 17.25. Operation 28: REMOVE - Remove File System Object . . . . 380
17.26. Operation 29: RENAME - Rename Directory Entry . . . . . 381 17.26. Operation 29: RENAME - Rename Directory Entry . . . . . 382
17.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 383 17.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 384
17.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 384 17.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 385
17.29. Operation 33: SECINFO - Obtain Available Security . . . 384 17.29. Operation 33: SECINFO - Obtain Available Security . . . 385
17.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 388 17.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 389
17.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 390 17.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 391
17.32. Operation 38: WRITE - Write to File . . . . . . . . . . 391 17.32. Operation 38: WRITE - Write to File . . . . . . . . . . 392
17.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 396 17.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 397
17.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 397 17.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 398
17.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 399 17.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 400
17.36. Operation 43: CREATE_SESSION - Create New Session and 17.36. Operation 43: CREATE_SESSION - Create New Session and
Confirm Client ID . . . . . . . . . . . . . . . . . . . 416 Confirm Client ID . . . . . . . . . . . . . . . . . . . 417
17.37. Operation 44: DESTROY_SESSION - Destroy existing 17.37. Operation 44: DESTROY_SESSION - Destroy existing
session . . . . . . . . . . . . . . . . . . . . . . . . 426 session . . . . . . . . . . . . . . . . . . . . . . . . 427
17.38. Operation 45: FREE_STATEID - Free stateid with no 17.38. Operation 45: FREE_STATEID - Free stateid with no
locks . . . . . . . . . . . . . . . . . . . . . . . . . 427 locks . . . . . . . . . . . . . . . . . . . . . . . . . 428
17.39. Operation 46: GET_DIR_DELEGATION - Get a directory 17.39. Operation 46: GET_DIR_DELEGATION - Get a directory
delegation . . . . . . . . . . . . . . . . . . . . . . . 428 delegation . . . . . . . . . . . . . . . . . . . . . . . 429
17.40. Operation 47: GETDEVICEINFO - Get Device Information . . 433 17.40. Operation 47: GETDEVICEINFO - Get Device Information . . 434
17.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 434 17.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 435
17.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 17.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
a layout . . . . . . . . . . . . . . . . . . . . . . . . 435 a layout . . . . . . . . . . . . . . . . . . . . . . . . 436
17.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 438 17.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 439
17.44. Operation 51: LAYOUTRETURN - Release Layout 17.44. Operation 51: LAYOUTRETURN - Release Layout
Information . . . . . . . . . . . . . . . . . . . . . . 441 Information . . . . . . . . . . . . . . . . . . . . . . 442
17.45. Operation 52: SECINFO_NO_NAME - Get Security on 17.45. Operation 52: SECINFO_NO_NAME - Get Security on
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 444 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 445
17.46. Operation 53: SEQUENCE - Supply per-procedure 17.46. Operation 53: SEQUENCE - Supply per-procedure
sequencing and control . . . . . . . . . . . . . . . . . 445 sequencing and control . . . . . . . . . . . . . . . . . 446
17.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 452 17.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 453
17.48. Operation 55: TEST_STATEID - Test stateids for 17.48. Operation 55: TEST_STATEID - Test stateids for
validity . . . . . . . . . . . . . . . . . . . . . . . . 454 validity . . . . . . . . . . . . . . . . . . . . . . . . 455
17.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 455 17.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 456
17.50. Operation 57: DESTROY_CLIENTID - Destroy existing 17.50. Operation 57: DESTROY_CLIENTID - Destroy existing
client ID . . . . . . . . . . . . . . . . . . . . . . . 458 client ID . . . . . . . . . . . . . . . . . . . . . . . 459
17.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims 17.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims
Finished . . . . . . . . . . . . . . . . . . . . . . . . 459 Finished . . . . . . . . . . . . . . . . . . . . . . . . 460
17.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 460 17.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 461
18. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 461 18. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 462
18.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 461 18.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 462
18.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 461 18.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 462
19. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 463 19. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 464
19.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 463 19.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 464
19.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 465 19.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 466
19.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 466 19.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 467
19.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 468 19.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 470
19.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 471 19.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 473
19.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 472 19.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 474
19.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 475 19.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 477
19.8. Operation 10: CB_RECALL_SLOT - change flow control 19.8. Operation 10: CB_RECALL_SLOT - change flow control
limits . . . . . . . . . . . . . . . . . . . . . . . . . 476 limits . . . . . . . . . . . . . . . . . . . . . . . . . 478
19.9. Operation 11: CB_SEQUENCE - Supply backchannel 19.9. Operation 11: CB_SEQUENCE - Supply backchannel
sequencing and control . . . . . . . . . . . . . . . . . 477 sequencing and control . . . . . . . . . . . . . . . . . 479
19.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 480 19.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 482
19.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible 19.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible
lock availability . . . . . . . . . . . . . . . . . . . 481 lock availability . . . . . . . . . . . . . . . . . . . 483
19.12. Operation 10044: CB_ILLEGAL - Illegal Callback 19.12. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . . 482 Operation . . . . . . . . . . . . . . . . . . . . . . . 484
20. Security Considerations . . . . . . . . . . . . . . . . . . . 483 20. Security Considerations . . . . . . . . . . . . . . . . . . . 485
21. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 483 21. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 485
21.1. Defining new layout types . . . . . . . . . . . . . . . 483 21.1. Defining new layout types . . . . . . . . . . . . . . . 485
22. References . . . . . . . . . . . . . . . . . . . . . . . . . 484 22. References . . . . . . . . . . . . . . . . . . . . . . . . . 486
22.1. Normative References . . . . . . . . . . . . . . . . . . 484 22.1. Normative References . . . . . . . . . . . . . . . . . . 486
22.2. Informative References . . . . . . . . . . . . . . . . . 485 22.2. Informative References . . . . . . . . . . . . . . . . . 487
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 487 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 489
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 487 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 490
Intellectual Property and Copyright Statements . . . . . . . . . 489 Intellectual Property and Copyright Statements . . . . . . . . . 491
1. Introduction 1. Introduction
1.1. The NFSv4.1 Protocol 1.1. The NFSv4.1 Protocol
The NFSv4.1 protocol is a minor version of the NFSv4 protocol The NFSv4.1 protocol is a minor version of the NFSv4 protocol
described in [2]. It generally follows the guidelines for minor described in [2]. It generally follows the guidelines for minor
versioning model laid in Section 10 of RFC 3530. However, it versioning model laid in Section 10 of RFC 3530. However, it
diverges from guidelines 11 ("a client and server that supports minor diverges from guidelines 11 ("a client and server that supports minor
version X must support minor versions 0 through X-1"), and 12 ("no version X must support minor versions 0 through X-1"), and 12 ("no
skipping to change at page 22, line 23 skipping to change at page 22, line 23
context of a session. Sessions provide a client context for every context of a session. Sessions provide a client context for every
request and support robust reply protection for non-idempotent request and support robust reply protection for non-idempotent
requests. requests.
2.4. Client Identifiers and Client Owners 2.4. Client Identifiers and Client Owners
For each operation that obtains or depends on locking state, the For each operation that obtains or depends on locking state, the
specific client must be determinable by the server. In NFSv4, each specific client must be determinable by the server. In NFSv4, each
distinct client instance is represented by a client ID, which is a distinct client instance is represented by a client ID, which is a
64-bit identifier that identifies a specific client at a given time 64-bit identifier that identifies a specific client at a given time
and which is changed whenever the client or the server re- and which is changed whenever the client re-initializes, and may
initializes. Client IDs are used to support lock identification and change when the server re-initializes. Client IDs are used to
crash recovery. support lock identification and crash recovery.
In NFSv4.1, during steady state operation, the client ID associated In NFSv4.1, during steady state operation, the client ID associated
with each operation is derived from the session (see Section 2.10) on with each operation is derived from the session (see Section 2.10) on
which the operation is issued. Each session is associated with a which the operation is issued. Each session is associated with a
specific client ID at session creation and that client ID then specific client ID at session creation and that client ID then
becomes the client ID associated with all requests issued using it. becomes the client ID associated with all requests issued using it.
Therefore, unlike NFSv4.0, the only NFSv4.1 operations possible Therefore, unlike NFSv4.0, the only NFSv4.1 operations possible
before a client ID is established, are those directly connected with before a client ID is established are those needed to establish the
establishing the client ID. client ID.
A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION
operation using that client ID (eir_clientid as returned from operation using that client ID (eir_clientid as returned from
EXCHANGE_ID) is required to establish the identification on the EXCHANGE_ID) is required to establish the identification on the
server. Establishment of identification by a new incarnation of the server. Establishment of identification by a new incarnation of the
client also has the effect of immediately releasing any locking state client also has the effect of immediately releasing any locking state
that a previous incarnation of that same client might have had on the that a previous incarnation of that same client might have had on the
server. Such released state would include all lock, share server. Such released state would include all lock, share
reservation, layout state, and where the server is not supporting the reservation, layout state, and where the server is not supporting the
CLAIM_DELEGATE_PREV claim type, all delegation state associated with CLAIM_DELEGATE_PREV claim type, all delegation state associated with
skipping to change at page 24, line 43 skipping to change at page 24, line 43
* A true random number. However since this number ought to be * A true random number. However since this number ought to be
the same between client incarnations, this shares the same the same between client incarnations, this shares the same
problem as that of the using the timestamp of the software problem as that of the using the timestamp of the software
installation. installation.
o For a user level NFS version 4 client, it should contain o For a user level NFS version 4 client, it should contain
additional information to distinguish the client from other user additional information to distinguish the client from other user
level clients running on the same host, such as a process level clients running on the same host, such as a process
identifier or other unique sequence. identifier or other unique sequence.
As a security measure, the server MUST NOT cancel a client's leased
state if the principal established the state for a given co_ownerid
string is not the same as the principal issuing the EXCHANGE_ID.
A server may compare a client_owner4 in an EXCHANGE_ID with an A server may compare a client_owner4 in an EXCHANGE_ID with an
nfs_client_id4 established using SETCLIENTID using NFSv4 minor nfs_client_id4 established using SETCLIENTID using NFSv4 minor
version 0, so that an NFSv4.1 client is not forced to delay until version 0, so that an NFSv4.1 client is not forced to delay until
lease expiration for locking state established by the earlier client lease expiration for locking state established by the earlier client
using minor version 0. This requires the client_owner4 be using minor version 0. This requires the client_owner4 be
constructed the same way as the nfs_client_id4. If the latter's constructed the same way as the nfs_client_id4. If the latter's
contents included the server's network address, and the NFSv4.1 contents included the server's network address, and the NFSv4.1
client does not wish to use a client ID that prevents trunking, it client does not wish to use a client ID that prevents trunking, it
should issue two EXCHANGE_ID operations. The first EXCHANGE_ID will should issue two EXCHANGE_ID operations. The first EXCHANGE_ID will
have a client_owner4 equal to the nfs_client_id4. This will clear have a client_owner4 equal to the nfs_client_id4. This will clear
skipping to change at page 57, line 4 skipping to change at page 56, line 42
layer to handle receives. These buffers remain in use by the RPC/ layer to handle receives. These buffers remain in use by the RPC/
NFSv4.1 implementation; the size and number of them must be known to NFSv4.1 implementation; the size and number of them must be known to
the remote peer in order to avoid RDMA errors which would cause a the remote peer in order to avoid RDMA errors which would cause a
fatal error on the RDMA connection. fatal error on the RDMA connection.
NFSv4.1 manages slots as resources on a per session basis (see NFSv4.1 manages slots as resources on a per session basis (see
Section 2.10), while RDMA connections manage credits on a per Section 2.10), while RDMA connections manage credits on a per
connection basis. This means that in order for a peer to send data connection basis. This means that in order for a peer to send data
over RDMA to a remote buffer, it has to have both an NFSv4.1 slot, over RDMA to a remote buffer, it has to have both an NFSv4.1 slot,
and an RDMA credit. If multiple RDMA connections are associated with and an RDMA credit. If multiple RDMA connections are associated with
a session, then if the total number of creds across all RDMA a session, then if the total number of credits across all RDMA
connections associated with the session is X, and the number slots in connections associated with the session is X, and the number slots in
the session is Y, then the maximum number of outstanding requests is the session is Y, then the maximum number of outstanding requests is
lesser of X and Y. lesser of X and Y.
2.10.6.2. Flow Control 2.10.6.2. Flow Control
Previous versions of NFS do not provide flow control; instead they Previous versions of NFS do not provide flow control; instead they
rely on the windowing provided by transports like TCP to throttle rely on the windowing provided by transports like TCP to throttle
requests. This does not work with RDMA, which provides no operation requests. This does not work with RDMA, which provides no operation
flow control and will terminate a connection in error when limits are flow control and will terminate a connection in error when limits are
skipping to change at page 57, line 45 skipping to change at page 57, line 34
2.10.6.3. Padding 2.10.6.3. Padding
Header padding is requested by each peer at session initiation (see Header padding is requested by each peer at session initiation (see
the ca_headerpadsize argument to CREATE_SESSION in Section 17.36), the ca_headerpadsize argument to CREATE_SESSION in Section 17.36),
and subsequently used by the RPC RDMA layer, as described in [9]. and subsequently used by the RPC RDMA layer, as described in [9].
Zero padding is permitted. Zero padding is permitted.
Padding leverages the useful property that RDMA preserve alignment of Padding leverages the useful property that RDMA preserve alignment of
data, even when they are placed into anonymous (untagged) buffers. data, even when they are placed into anonymous (untagged) buffers.
If requested, client inline writes will insert appropriate pad bytes If requested, client inline writes will insert appropriate pad octets
within the request header to align the data payload on the specified within the request header to align the data payload on the specified
boundary. The client is encouraged to add sufficient padding (up to boundary. The client is encouraged to add sufficient padding (up to
the negotiated size) so that the "data" field of the NFSv4.1 WRITE the negotiated size) so that the "data" field of the NFSv4.1 WRITE
operation is aligned. Most servers can make good use of such operation is aligned. Most servers can make good use of such
padding, which allows them to chain receive buffers in such a way padding, which allows them to chain receive buffers in such a way
that any data carried by client requests will be placed into that any data carried by client requests will be placed into
appropriate buffers at the server, ready for file system processing. appropriate buffers at the server, ready for file system processing.
The receiver's RPC layer encounters no overhead from skipping over The receiver's RPC layer encounters no overhead from skipping over
pad bytes, and the RDMA layer's high performance makes the insertion pad octets, and the RDMA layer's high performance makes the insertion
and transmission of padding on the sender a significant optimization. and transmission of padding on the sender a significant optimization.
In this way, the need for servers to perform RDMA Read to satisfy all In this way, the need for servers to perform RDMA Read to satisfy all
but the largest client writes is obviated. An added benefit is the but the largest client writes is obviated. An added benefit is the
reduction of message round trips on the network - a potentially good reduction of message round trips on the network - a potentially good
trade, where latency is present. trade, where latency is present.
The value to choose for padding is subject to a number of criteria. The value to choose for padding is subject to a number of criteria.
A primary source of variable-length data in the RPC header is the A primary source of variable-length data in the RPC header is the
authentication information, the form of which is client-determined, authentication information, the form of which is client-determined,
possibly in response to server specification. The contents of possibly in response to server specification. The contents of
COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all
go into the determination of a maximal NFSv4.1 request size and go into the determination of a maximal NFSv4.1 request size and
therefore minimal buffer size. The client must select its offered therefore minimal buffer size. The client must select its offered
value carefully, so as not to overburden the server, and vice- versa. value carefully, so as not to overburden the server, and vice- versa.
The payoff of an appropriate padding value is higher performance. The payoff of an appropriate padding value is higher performance.
[[Comment.5: RFC editor please keep this diagram on one page.]] [[Comment.5: RFC editor please keep this diagram on one page.]]
Sender gather: Sender gather:
|RPC Request|Pad bytes|Length| -> |User data...| |RPC Request|Pad octets|Length| -> |User data...|
\------+---------------------/ \ \------+----------------------/ \
\ \ \ \
\ Receiver scatter: \-----------+- ... \ Receiver scatter: \-----------+- ...
/-----+----------------\ \ \ /-----+----------------\ \ \
|RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->...
In the above case, the server may recycle unused buffers to the next In the above case, the server may recycle unused buffers to the next
posted receive if unused by the actual received request, or may pass posted receive if unused by the actual received request, or may pass
the now-complete buffers by reference for normal write processing. the now-complete buffers by reference for normal write processing.
For a server which can make use of it, this removes any need for data For a server which can make use of it, this removes any need for data
copies of incoming data, without resorting to complicated end-to-end copies of incoming data, without resorting to complicated end-to-end
skipping to change at page 60, line 38 skipping to change at page 60, line 27
a legitimate session, or associate a rogue session with a a legitimate session, or associate a rogue session with a
legitimate client ID in order to maliciously alter the client ID's legitimate client ID in order to maliciously alter the client ID's
lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc. lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc.
o In cases where the server's security policies on a portion of its o In cases where the server's security policies on a portion of its
namespace require RPCSEC_GSS authentication, a client may have to namespace require RPCSEC_GSS authentication, a client may have to
use an RPCSEC_GSS credential to remove per-file state (for example use an RPCSEC_GSS credential to remove per-file state (for example
LOCKU, CLOSE, etc.). The server may require that the principal LOCKU, CLOSE, etc.). The server may require that the principal
that removes the state match certain criteria (for example, the that removes the state match certain criteria (for example, the
principal might have to be the same as the one that acquired the principal might have to be the same as the one that acquired the
state). However, the client might not be have an RPCSEC_GSS state). However, the client might not have an RPCSEC_GSS context
context for such a principal, and might not be able to create such for such a principal, and might not be able to create such a
a context (perhaps because the user has logged off). When the context (perhaps because the user has logged off). When the
client establishes SP4_MACH_CRED or SP4_SSV protection, it can client establishes SP4_MACH_CRED or SP4_SSV protection, it can
specify a list of operations that the server MUST allow using the specify a list of operations that the server MUST allow using the
machine credential (if SP4_MACH_CRED is used) or the SSV machine credential (if SP4_MACH_CRED is used) or the SSV
credential (if SP4_SSV is used). credential (if SP4_SSV is used).
The SP4_MACH_CRED state protection option uses a machine credential The SP4_MACH_CRED state protection option uses a machine credential
where the principal that creates the client ID, must also be the where the principal that creates the client ID, must also be the
principal that performs client ID and session maintenance operations. principal that performs client ID and session maintenance operations.
The security of the machine credential state protection approach The security of the machine credential state protection approach
depends entirely on safe guarding the per-machine credential. depends entirely on safe guarding the per-machine credential.
skipping to change at page 62, line 18 skipping to change at page 62, line 7
with an SSV value that is zero. For this reason, each time a new with an SSV value that is zero. For this reason, each time a new
principal uses a client ID for the first time, the client SHOULD principal uses a client ID for the first time, the client SHOULD
issue a SET_SSV with that principal's RPCSEC_GSS credentials, with issue a SET_SSV with that principal's RPCSEC_GSS credentials, with
RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY. RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY.
Here are the types of attacks that can be attempted by an attacker Here are the types of attacks that can be attempted by an attacker
named Eve on a victim named Bob, and how SP4_SSV protection foils named Eve on a victim named Bob, and how SP4_SSV protection foils
each attack: each attack:
o Suppose Eve is the first user to log into a legitimate client. o Suppose Eve is the first user to log into a legitimate client.
Eve's use of an NFSv4.1 filesystem will cause an SSV to be created Eve's use of an NFSv4.1 file system will cause an SSV to be
via the legitimate client's NFSv4.1 implementation. The SET_SSV created via the legitimate client's NFSv4.1 implementation. The
that creates the SSV will be protected by the RPCSEC_GSS context SET_SSV that creates the SSV will be protected by the RPCSEC_GSS
created by the legitimate client which uses Eve's GSS principal context created by the legitimate client which uses Eve's GSS
and credentials. Eve can eavesdrop on the network while her principal and credentials. Eve can eavesdrop on the network while
RPCSEC_GSS context is created, and the SET_SSV using her context her RPCSEC_GSS context is created, and the SET_SSV using her
is issued. Even if the legitimate client issues the SET_SSV with context is issued. Even if the legitimate client issues the
RPC_GSS_SVC_PRIVACY, because Eve knows her own credentials, she SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve knows her own
can decrypt the SSV. Eve can compute an RPCSEC_GSS credential credentials, she can decrypt the SSV. Eve can compute an
that BIND_CONN_TO_SESSION will accept, and so associate a new RPCSEC_GSS credential that BIND_CONN_TO_SESSION will accept, and
connection with the legitimate session. Eve can change the slot so associate a new connection with the legitimate session. Eve
id and sequence state of a legitimate session, and/or the SSV can change the slot id and sequence state of a legitimate session,
state, in such a way that when Bob accesses the server via the and/or the SSV state, in such a way that when Bob accesses the
same legitimate client, the legitimate client will be unable to server via the same legitimate client, the legitimate client will
use the session. be unable to use the session.
The client's only recourse is to create a new client ID for Bob to The client's only recourse is to create a new client ID for Bob to
use, and establish a new SSV for the client ID. The client will use, and establish a new SSV for the client ID. The client will
be unable to delete the old client ID, and will let the lease on be unable to delete the old client ID, and will let the lease on
old client ID expire. old client ID expire.
Once the legitimate client establishes an SSV over the new session Once the legitimate client establishes an SSV over the new session
using Bob's RPCSEC_GSS context, Eve can use the new session via using Bob's RPCSEC_GSS context, Eve can use the new session via
the legitimate client, but she cannot disrupt Bob. Moreover, the legitimate client, but she cannot disrupt Bob. Moreover,
because the client SHOULD have modified the SSV due to Eve using because the client SHOULD have modified the SSV due to Eve using
skipping to change at page 64, line 32 skipping to change at page 64, line 22
The SSV provides the secret key for a mechanism that NFSv4.1 uses for The SSV provides the secret key for a mechanism that NFSv4.1 uses for
state protection. Contexts for this mechanism are not established state protection. Contexts for this mechanism are not established
via the RPCSEC_GSS protocol. Instead, the contexts are automatically via the RPCSEC_GSS protocol. Instead, the contexts are automatically
created when EXCHANGE_ID specifies SP4_SSV protection. The only created when EXCHANGE_ID specifies SP4_SSV protection. The only
tokens defined are the PerMsgToken (emitted by GSS_GetMIC) and the tokens defined are the PerMsgToken (emitted by GSS_GetMIC) and the
SealedMessage (emitted by GSS_Wrap). SealedMessage (emitted by GSS_Wrap).
The mechanism OID for the SSV mechanism is: The mechanism OID for the SSV mechanism is:
iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech
(1.3.6.1.4.1.28882.1.1). While the SSV mechansims does not define (1.3.6.1.4.1.28882.1.1). While the SSV mechanisms does not define
any initial context tokens, the OID can be used to let servers any initial context tokens, the OID can be used to let servers
indicate that the SSV mechanism is acceptable whenever the client indicate that the SSV mechanism is acceptable whenever the client
issues a SECINFO or SECINFO_NO_NAME operation (see Section 2.6). issues a SECINFO or SECINFO_NO_NAME operation (see Section 2.6).
The PerMsgToken description is based on an XDR definition: The PerMsgToken description is based on an XDR definition:
/* Input for computing smt_hmac */ /* Input for computing smt_hmac */
struct ssv_mic_plain_tkn4 { struct ssv_mic_plain_tkn4 {
uint32_t smpt_ssv_seq; uint32_t smpt_ssv_seq;
opaque smpt_orig_plain<>; opaque smpt_orig_plain<>;
}; };
/* SSV GSS SealedMessage token */ /* SSV GSS PerMsgToken token */
struct ssv_mic_tkn4 { struct ssv_mic_tkn4 {
uint64_t smt_ssv_seq; uint32_t smt_ssv_seq;
opaque smt_hmac<>; opaque smt_hmac<>;
}; };
The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type
ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence
number which is equal to 1 after SET_SSV is called the first time on number which is equal to 1 after SET_SSV is called the first time on
a client ID. Thereafter, it is incremented on each SET_SSV. Thus a client ID. Thereafter, it is incremented on each SET_SSV. Thus
smt_ssv_seq represents the version of the SSV at the time smt_ssv_seq represents the version of the SSV at the time
GSS_GetMIC() was called. This allows the SSV to be changed without GSS_GetMIC() was called. This allows the SSV to be changed without
serializing all RPC calls that use the SSV mechanism with SET_SSV serializing all RPC calls that use the SSV mechanism with SET_SSV
operations. operations.
The field smt_hmac is an HMAC ([12]), calculated by using the current The field smt_hmac is an HMAC ([12]), calculated by using the current
skipping to change at page 65, line 33 skipping to change at page 65, line 27
struct ssv_seal_plain_tkn4 { struct ssv_seal_plain_tkn4 {
opaque sspt_confounder<>; opaque sspt_confounder<>;
uint32_t sspt_ssv_seq; uint32_t sspt_ssv_seq;
opaque sspt_orig_plain<>; opaque sspt_orig_plain<>;
opaque sspt_pad<>; opaque sspt_pad<>;
}; };
/* SSV GSS SealedMessage token */ /* SSV GSS SealedMessage token */
struct ssv_seal_cipher_tkn4 { struct ssv_seal_cipher_tkn4 {
uint32_t ssct_ssv_seq; uint32_t ssct_ssv_seq;
opaque ssct_iv<>;
opaque ssct_encr_data<>; opaque ssct_encr_data<>;
opaque ssct_hmac<>; opaque ssct_hmac<>;
}; };
The token emitted by GSS_Wrap() is XDR encoded and of XDR data type The token emitted by GSS_Wrap() is XDR encoded and of XDR data type
ssv_seal_cipher_tkn4. The field ssct_ssv_seq has the same meaning as ssv_seal_cipher_tkn4.
smt_ssv_seq. The ssct_encr_data field is the result of encrypting a
value of the XDR encoded data type ssv_seal_plain_tkn4. The The ssct_ssv_seq field has the same meaning as smt_ssv_seq.
encryption key is the SSV, and the encryption algorithm is that
negotiated by EXCHANGE_ID. The ssct_encr_data field is the result of encrypting a value of the
XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the
SSV, and the encryption algorithm is that negotiated by EXCHANGE_ID.
The ssct_iv field is the initialization vector (IV) for the
encryption algorithm (if applicable) and is sent in clear text. The
content and size of the IV MUST comply with specification of the
encryption algorithm. For example, the id-aes256-CBC algorithm MUST
use a 16 octet initialization vector (IV) which MUST be unpredictable
for each instance of a value of type ssv_seal_plain_tkn4 that is
encrypted with a particular SSV key.
The ssct_hmac field is the result of computing an HMAC using value of The ssct_hmac field is the result of computing an HMAC using value of
the XDR encoded data type ssv_seal_plain_tkn4 as the input text. The the XDR encoded data type ssv_seal_plain_tkn4 as the input text. The
key is the SSV, and the one way hash algorithm is that negotiated by key is the SSV, and the one way hash algorithm is that negotiated by
EXCHANGE_ID. EXCHANGE_ID.
The sspt_confounder field is a random value. The sspt_confounder field is a random value.
The sspt_ssv_seq field is the same as ssvt_ssv_seq. The sspt_ssv_seq field is the same as ssvt_ssv_seq.
skipping to change at page 67, line 6 skipping to change at page 67, line 15
replaced without destroying the SSV's GSS contexts. If for some replaced without destroying the SSV's GSS contexts. If for some
reason SSV RPCSEC_GSS handles expire, the EXCHANGE_ID operation can reason SSV RPCSEC_GSS handles expire, the EXCHANGE_ID operation can
be used to create more SSV RPCSEC_GSS handles. be used to create more SSV RPCSEC_GSS handles.
The client MUST establish an SSV via SET_SSV before the GSS context The client MUST establish an SSV via SET_SSV before the GSS context
can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). If can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). If
SET_SSV has not been successfully called, attempts to emit tokens SET_SSV has not been successfully called, attempts to emit tokens
MUST fail. MUST fail.
The SSV mechanism does not support replay detection and sequencing in The SSV mechanism does not support replay detection and sequencing in
its tokens becuase RPCSEC_GSS does not use those features (Section its tokens because RPCSEC_GSS does not use those features (Section
5.2.2 "Context Creation Requests" in [5]). 5.2.2 "Context Creation Requests" in [5]).
2.10.8. Session Mechanics - Steady State 2.10.8. Session Mechanics - Steady State
2.10.8.1. Obligations of the Server 2.10.8.1. Obligations of the Server
The server has the primary obligation to monitor the state of The server has the primary obligation to monitor the state of
backchannel resources that the client has created for the server backchannel resources that the client has created for the server
(RPCSEC_GSS contexts and backchannel connections). If these (RPCSEC_GSS contexts and backchannel connections). If these
resources vanish, the server takes action as specified in resources vanish, the server takes action as specified in
skipping to change at page 72, line 38 skipping to change at page 72, line 40
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831 version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831
[4] documents. The next sections build upon the XDR data types to [4] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
3.1. Basic Data Types 3.1. Basic Data Types
These are the base NFSv4 data types. These are the base NFSv4 data types.
+---------------+---------------------------------------------------+ +----------------------+--------------------------------------------+
| Data Type | Definition | | Data Type | Definition |
+---------------+---------------------------------------------------+ +----------------------+--------------------------------------------+
| int32_t | typedef int int32_t; | | int32_t | typedef int int32_t; |
| uint32_t | typedef unsigned int uint32_t; | | uint32_t | typedef unsigned int uint32_t; |
| int64_t | typedef hyper int64_t; | | int64_t | typedef hyper int64_t; |
| uint64_t | typedef unsigned hyper uint64_t; | | uint64_t | typedef unsigned hyper uint64_t; |
| attrlist4 | typedef opaque attrlist4<>; | | attrlist4<> | typedef opaque attrlist4<>; |
| | Used for file/directory attributes | | | Used for file/directory attributes |
| bitmap4 | typedef uint32_t bitmap4<>; | | bitmap4<> | typedef uint32_t bitmap4<>; |
| | Used in attribute array encoding. | | | Used in attribute array encoding. |
| changeid4 | typedef uint64_t changeid4; | | changeid4 | typedef uint64_t changeid4; |
| | Used in definition of change_info | | | Used in definition of change_info |
| clientid4 | typedef uint64_t clientid4; | | clientid4 | typedef uint64_t clientid4; |
| | Shorthand reference to client identification | | | Shorthand reference to client |
| component4 | typedef utf8str_cs component4; | | | identification |
| | Represents path name components |
| count4 | typedef uint32_t count4; | | count4 | typedef uint32_t count4; |
| | Various count parameters (READ, WRITE, COMMIT) | | | Various count parameters (READ, WRITE, |
| | COMMIT) |
| length4 | typedef uint64_t length4; | | length4 | typedef uint64_t length4; |
| | Describes LOCK lengths | | | Describes LOCK lengths |
| linktext4 | typedef utf8str_cs linktext4; |
| | Symbolic link contents |
| mode4 | typedef uint32_t mode4; | | mode4 | typedef uint32_t mode4; |
| | Mode attribute data type | | | Mode attribute data type |
| nfs_cookie4 | typedef uint64_t nfs_cookie4; | | nfs_cookie4 | typedef uint64_t nfs_cookie4; |
| | Opaque cookie value for READDIR | | | Opaque cookie value for READDIR |
| nfs_fh4 | typedef opaque nfs_fh4<NFS4_FHSIZE> | | nfs_fh4<NFS4_FHSIZE> | typedef opaque nfs_fh4<NFS4_FHSIZE>; |
| | Filehandle definition; NFS4_FHSIZE is defined as | | | Filehandle definition; NFS4_FHSIZE is |
| | 128 | | | defined as 128 |
| nfs_ftype4 | enum nfs_ftype4; | | nfs_ftype4 | enum nfs_ftype4; |
| | Various defined file types | | | Various defined file types |
| nfsstat4 | enum nfsstat4; | | nfsstat4 | enum nfsstat4; |
| | Return value for operations | | | Return value for operations |
| offset4 | typedef uint64_t offset4; | | offset4 | typedef uint64_t offset4; |
| | Various offset designations (READ, WRITE, LOCK, | | | Various offset designations (READ, WRITE, |
| | COMMIT) | | | LOCK, COMMIT) |
| pathname4 | typedef component4 pathname4<>; |
| | Represents path name for fs_locations |
| qop4 | typedef uint32_t qop4; | | qop4 | typedef uint32_t qop4; |
| | Quality of protection designation in SECINFO | | | Quality of protection designation in |
| sec_oid4 | typedef opaque sec_oid4<>; | | | SECINFO |
| | Security Object Identifier The sec_oid4 data type | | sec_oid4<> | typedef opaque sec_oid4<>; |
| | is not really opaque. Instead contains an ASN.1 | | | Security Object Identifier The sec_oid4 |
| | OBJECT IDENTIFIER as used by GSS-API in the | | | data type is not really opaque. Instead |
| | mech_type argument to GSS_Init_sec_context. See | | | it contains an ASN.1 OBJECT IDENTIFIER as |
| | RFC2743 [8] for details. | | | used by GSS-API in the mech_type argument |
| | to GSS_Init_sec_context. See [8] for |
| | details. |
| sequenceid4 | typedef uint32_t sequenceid4; | | sequenceid4 | typedef uint32_t sequenceid4; |
| | sequence number used for various session | | | sequence number used for various session |
| | operations (EXCHANGE_ID, CREATE_SESSION, | | | operations (EXCHANGE_ID, CREATE_SESSION, |
| | SEQUENCE, CB_SEQUENCE). | | | SEQUENCE, CB_SEQUENCE). |
| seqid4 | typedef uint32_t seqid4; | | seqid4 | typedef uint32_t seqid4; |
| | Sequence identifier used for file locking | | | Sequence identifier used for file locking |
| sessionid4 | typedef opaque sessionid4[16]; | | sessionid4 | typedef opaque sessionid4[16]; |
| | Session identifier | | | Session identifier |
| slotid4 | typedef uint32_t slotid4; | | slotid4 | typedef uint32_t slotid4; |
| | sequencing artifact various session operations | | | sequencing artifact for various session |
| | (SEQUENCE, CB_SEQUENCE). | | | operations (SEQUENCE, CB_SEQUENCE). |
| utf8string | typedef opaque utf8string<>; | | utf8string<> | typedef opaque utf8string<>; |
| | UTF-8 encoding for strings | | | UTF-8 encoding for strings |
| utf8str_cis | typedef opaque utf8str_cis; | | utf8str_cis | typedef utf8string utf8str_cis; |
| | Case-insensitive UTF-8 string | | | Case-insensitive UTF-8 string |
| utf8str_cs | typedef opaque utf8str_cs; | | utf8str_cs | typedef utf8string utf8str_cs; |
| | Case-sensitive UTF-8 string | | | Case-sensitive UTF-8 string |
| utf8str_mixed | typedef opaque utf8str_mixed; | | utf8str_mixed | typedef utf8string utf8str_mixed; |
| | UTF-8 strings with a case sensitive prefix and a | | | UTF-8 strings with a case sensitive prefix |
| | case insensitive suffix. | | | and a case insensitive suffix. |
| verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | | component4 | typedef utf8str_cs component4; |
| | Verifier used for various operations (COMMIT, | | | Represents path name components |
| | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE) | | linktext4 | typedef utf8str_cs linktext4; |
| | NFS4_VERIFIER_SIZE is defined as 8. | | | Symbolic link contents |
+---------------+---------------------------------------------------+ | pathname4<> | typedef component4 pathname4<>; |
| | Represents path name for fs_locations |
| verifier4 | typedef opaque |
| | verifier4[NFS4_VERIFIER_SIZE]; |
| | Verifier used for various operations |
| | (COMMIT, CREATE, EXCHANGE_ID, OPEN, |
| | READDIR, WRITE) NFS4_VERIFIER_SIZE is |
| | defined as 8. |
+----------------------+--------------------------------------------+
End of Base Data Types End of Base Data Types
Table 1 Table 1
3.2. Structured Data Types 3.2. Structured Data Types
3.2.1. nfstime4 3.2.1. nfstime4
struct nfstime4 { struct nfstime4 {
int64_t seconds; int64_t seconds;
uint32_t nseconds; uint32_t nseconds;
} };
The nfstime4 structure gives the number of seconds and nanoseconds The nfstime4 structure gives the number of seconds and nanoseconds
since midnight or 0 hour January 1, 1970 Coordinated Universal Time since midnight or 0 hour January 1, 1970 Coordinated Universal Time
(UTC). Values greater than zero for the seconds field denote dates (UTC). Values greater than zero for the seconds field denote dates
after the 0 hour January 1, 1970. Values less than zero for the after the 0 hour January 1, 1970. Values less than zero for the
seconds field denote dates before the 0 hour January 1, 1970. In seconds field denote dates before the 0 hour January 1, 1970. In
both cases, the nseconds field is to be added to the seconds field both cases, the nseconds field is to be added to the seconds field
for the final time representation. For example, if the time to be for the final time representation. For example, if the time to be
represented is one-half second before 0 hour January 1, 1970, the represented is one-half second before 0 hour January 1, 1970, the
seconds field would have a value of negative one (-1) and the seconds field would have a value of negative one (-1) and the
skipping to change at page 77, line 9 skipping to change at page 77, line 9
}; };
This structure is used with the CREATE, LINK, REMOVE, RENAME This structure is used with the CREATE, LINK, REMOVE, RENAME
operations to let the client know the value of the change attribute operations to let the client know the value of the change attribute
for the directory in which the target file system object resides. for the directory in which the target file system object resides.
3.2.10. netaddr4 3.2.10. netaddr4
struct netaddr4 { struct netaddr4 {
/* see struct rpcb in RFC1833 */ /* see struct rpcb in RFC1833 */
string r_netid<>; /* network id */ string na_r_netid<>; /* network id */
string r_addr<>; /* universal address */ string na_r_addr<>; /* universal address */
}; };
The netaddr4 structure is used to identify TCP/IP based endpoints. The netaddr4 structure is used to identify TCP/IP based endpoints.
The r_netid and r_addr fields are specified in RFC1833 [26], but they The r_netid and r_addr fields are specified in RFC1833 [26], but they
are underspecified in RFC1833 [26] as far as what they should look are underspecified in RFC1833 [26] as far as what they should look
like for specific protocols. like for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
US-ASCII string: US-ASCII string:
skipping to change at page 78, line 8 skipping to change at page 78, line 8
representing an IPv6 address as defined in Section 2.2 of RFC1884 representing an IPv6 address as defined in Section 2.2 of RFC1884
[13]. Additionally, the two alternative forms specified in Section [13]. Additionally, the two alternative forms specified in Section
2.2 of RFC1884 [13] are also acceptable. 2.2 of RFC1884 [13] are also acceptable.
For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP
over IPv6 the value of r_netid is the string "udp6". That this over IPv6 the value of r_netid is the string "udp6". That this
document specifies the universal address and netid for UDP/IPv6 does document specifies the universal address and netid for UDP/IPv6 does
not imply that UDP/IPv6 is a legal transport for NFSv4.1 (see not imply that UDP/IPv6 is a legal transport for NFSv4.1 (see
Section 2.9). Section 2.9).
3.2.11. open_owner4 3.2.11. state_owner4
struct open_owner4 { struct state_owner4 {
clientid4 clientid; clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT>; opaque owner<NFS4_OPAQUE_LIMIT>;
}; };
This structure is used to identify the owner of open state. typedef state_owner4 open_owner4;
NFS4_OPAQUE_LIMIT is defined as 1024. typedef state_owner4 lock_owner4;
3.2.12. lock_owner4 The state_owner4 data type is the base type for the open_owner4
Section 3.2.11.1 and lock_owner4 Section 3.2.11.2. NFS4_OPAQUE_LIMIT
is defined as 1024.
struct lock_owner4 { 3.2.11.1. open_owner4
clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT> This structure is used to identify the owner of open state.
};
3.2.11.2. lock_owner4
This structure is used to identify the owner of file locking state. This structure is used to identify the owner of file locking state.
3.2.13. open_to_lock_owner4 3.2.12. open_to_lock_owner4
struct open_to_lock_owner4 { struct open_to_lock_owner4 {
seqid4 open_seqid; seqid4 open_seqid;
stateid4 open_stateid; stateid4 open_stateid;
seqid4 lock_seqid; seqid4 lock_seqid;
lock_owner4 lock_owner; lock_owner4 lock_owner;
}; };
This structure is used for the first LOCK operation done for an This structure is used for the first LOCK operation done for an
open_owner4. It provides both the open_stateid and lock_owner such open_owner4. It provides both the open_stateid and lock_owner such
that the transition is made from a valid open_stateid sequence to that the transition is made from a valid open_stateid sequence to
that of the new lock_stateid sequence. Using this mechanism avoids that of the new lock_stateid sequence. Using this mechanism avoids
the confirmation of the lock_owner/lock_seqid pair since it is tied the confirmation of the lock_owner/lock_seqid pair since it is tied
to established state in the form of the open_stateid/open_seqid. to established state in the form of the open_stateid/open_seqid.
3.2.14. stateid4 3.2.13. stateid4
struct stateid4 { struct stateid4 {
uint32_t seqid; uint32_t seqid;
opaque other[12]; opaque other[12];
}; };
This structure is used for the various state sharing mechanisms This structure is used for the various state sharing mechanisms
between the client and server. For the client, this data structure between the client and server. For the client, this data structure
is read-only. The starting value of the seqid field is undefined. is read-only. The starting value of the seqid field is undefined.
The server is required to increment the seqid field monotonically at The server is required to increment the seqid field monotonically at
each transition of the stateid. This is important since the client each transition of the stateid. This is important since the client
will inspect the seqid in OPEN stateids to determine the order of will inspect the seqid in OPEN stateids to determine the order of
OPEN processing done by the server. OPEN processing done by the server.
3.2.15. layouttype4 3.2.14. layouttype4
enum layouttype4 { enum layouttype4 {
LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_NFSV4_1_FILES = 1,
LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_OSD2_OBJECTS = 2,
LAYOUT4_BLOCK_VOLUME = 3 LAYOUT4_BLOCK_VOLUME = 3
}; };
A layout type specifies the layout being used. The implication is A layout type specifies the layout being used. The implication is
that clients have "layout drivers" that support one or more layout that clients have "layout drivers" that support one or more layout
types. The file server advertises the layout types it supports types. The file server advertises the layout types it supports
skipping to change at page 79, line 38 skipping to change at page 79, line 40
globally unique and are assigned according to the description in globally unique and are assigned according to the description in
Section 21.1; they are maintained by IANA. Types within the range Section 21.1; they are maintained by IANA. Types within the range
0x80000000-0xFFFFFFFF are site specific and for "private use" only. 0x80000000-0xFFFFFFFF are site specific and for "private use" only.
The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file
layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration layout type is to be used. The LAYOUT4_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [29], is to be used. specifies that the object layout, as defined in [29], is to be used.
Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume Similarly, the LAYOUT4_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [30], is to be used. layout, as defined in [30], is to be used.
3.2.16. deviceid4 3.2.15. deviceid4
typedef uint32_t deviceid4; /* 32-bit device ID */ typedef uint64_t deviceid4;
Layout information includes device IDs that specify a storage device Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system ID is qualified by the layout type and are unique per file system
(FSID). This allows different layout drivers to generate device IDs (FSID). See Section 12.2.10 for more details.
without the need for co-ordination. See Section 12.2.12 for more
details.
3.2.17. device_addr4 3.2.16. device_addr4
struct device_addr4 { struct device_addr4 {
layouttype4 da_layout_type; layouttype4 da_layout_type;
opaque da_addr_body<>; opaque da_addr_body<>;
}; };
The device address is used to set up a communication channel with the The device address is used to set up a communication channel with the
storage device. Different layout types will require different types storage device. Different layout types will require different types
of structures to define how they communicate with storage devices. of structures to define how they communicate with storage devices.
The opaque da_addr_body field must be interpreted based on the The opaque da_addr_body field must be interpreted based on the
specified da_layout_type field. specified da_layout_type field.
This document defines the device address for the NFSv4.1 file layout This document defines the device address for the NFSv4.1 file layout
([[Comment.7: need xref]]), which identifies a storage device by ([[Comment.7: need xref]]), which identifies a storage device by
network IP address and port number. This is sufficient for the network IP address and port number. This is sufficient for the
clients to communicate with the NFSv4.1 storage devices, and may be clients to communicate with the NFSv4.1 storage devices, and may be
sufficient for other layout types as well. Device types for object sufficient for other layout types as well. Device types for object
storage devices and block storage devices (e.g., SCSI volume labels) storage devices and block storage devices (e.g., SCSI volume labels)
will be defined by their respective layout specifications. will be defined by their respective layout specifications.
3.2.18. devlist_item4 3.2.17. devlist_item4
struct devlist_item4 { struct devlist_item4 {
deviceid4 dli_id; deviceid4 dli_id;
device_addr4 dli_device_addr<>; device_addr4 dli_device_addr<>;
}; };
An array of these values is returned by the GETDEVICELIST operation. An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system for the They define the set of devices associated with a file system for the
layout type specified in the GETDEVICELIST4args. layout type specified in the GETDEVICELIST4args.
3.2.19. layout_content4 3.2.18. layout_content4
struct layout_content4 { struct layout_content4 {
layouttype4 loc_type; layouttype4 loc_type;
opaque loc_body<>; opaque loc_body<>;
}; };
The loc_body field must be interpreted based on the layout type The loc_body field must be interpreted based on the layout type
(loc_type). This document defines the loc_body for the NFSv4.1 file (loc_type). This document defines the loc_body for the NFSv4.1 file
layout type is defined; see Section 13.3 for its definition. layout type is defined; see Section 13.3 for its definition.
3.2.20. layout4 3.2.19. layout4
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layout_content4 lo_content; layout_content4 lo_content;
}; };
The layout4 structure defines a layout for a file. The layout type The layout4 structure defines a layout for a file. The layout type
specific data is opaque within lo_content. Since layouts are sub- specific data is opaque within lo_content. Since layouts are sub-
dividable, the offset and length together with the file's filehandle, dividable, the offset and length together with the file's filehandle,
the client ID, iomode, and layout type, identifies the layout. the client ID, iomode, and layout type, identifies the layout.
3.2.21. layoutupdate4 3.2.20. layoutupdate4
struct layoutupdate4 { struct layoutupdate4 {
layouttype4 lou_type; layouttype4 lou_type;
opaque lou_body<>; opaque lou_body<>;
}; };
The layoutupdate4 structure is used by the client to return 'updated' The layoutupdate4 structure is used by the client to return 'updated'
layout information to the metadata server at LAYOUTCOMMIT time. This layout information to the metadata server at LAYOUTCOMMIT time. This
structure provides a channel to pass layout type specific information structure provides a channel to pass layout type specific information
(in field lou_body) back to the metadata server. E.g., for block/ (in field lou_body) back to the metadata server. E.g., for block/
volume layout types this could include the list of reserved blocks volume layout types this could include the list of reserved blocks
that were written. The contents of the opaque lou_body argument are that were written. The contents of the opaque lou_body argument are
determined by the layout type and are defined in their context. The determined by the layout type and are defined in their context. The
NFSv4.1 file-based layout does not use this structure, thus the NFSv4.1 file-based layout does not use this structure, thus the
lou_body field should have a zero length. lou_body field should have a zero length.
3.2.22. layouthint4 3.2.21. layouthint4
struct layouthint4 { struct layouthint4 {
layouttype4 loh_type; layouttype4 loh_type;
opaque loh_body<>; opaque loh_body<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
It is the structure specified by the layout_hint attribute described It is the structure specified by the layout_hint attribute described
in Section 5.13.4. The metadata server may ignore the hint, or may in Section 5.13.4. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within provided at create time as part of the initial attributes within
OPEN. The loh_body field is specific to the type of layout OPEN. The loh_body field is specific to the type of layout
(loh_type). The NFSv4.1 file-based layout uses the (loh_type). The NFSv4.1 file-based layout uses the
nfsv4_1_file_layouthint4 structure as defined in Section 13.3. nfsv4_1_file_layouthint4 structure as defined in Section 13.3.
3.2.23. layoutiomode4 3.2.22. layoutiomode4
enum layoutiomode4 { enum layoutiomode4 {
LAYOUTIOMODE4_READ = 1, LAYOUTIOMODE4_READ = 1,
LAYOUTIOMODE4_RW = 2, LAYOUTIOMODE4_RW = 2,
LAYOUTIOMODE4_ANY = 3 LAYOUTIOMODE4_ANY = 3
}; };
The iomode specifies whether the client intends to read or write The iomode specifies whether the client intends to read or write
(with the possibility of reading) the data represented by the layout. (with the possibility of reading) the data represented by the layout.
The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be
used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies
that layouts pertaining to both READ and RW iomodes are being that layouts pertaining to both READ and RW iomodes are being
returned or recalled, respectively. The metadata server's use of the returned or recalled, respectively. The metadata server's use of the
iomode may depend on the layout type being used. The storage devices iomode may depend on the layout type being used. The storage devices
may validate I/O accesses against the iomode and reject invalid may validate I/O accesses against the iomode and reject invalid
accesses. accesses.
3.2.24. nfs_impl_id4 3.2.23. nfs_impl_id4
struct nfs_impl_id4 { struct nfs_impl_id4 {
utf8str_cis nii_domain; utf8str_cis nii_domain;
utf8str_cs nii_name; utf8str_cs nii_name;
nfstime4 nii_date; nfstime4 nii_date;
}; };
This structure is used to identify client and server implementation This structure is used to identify client and server implementation
detail. The nii_domain field is the DNS domain name that the detail. The nii_domain field is the DNS domain name that the
implementer is associated with. The nii_name field is the product implementer is associated with. The nii_name field is the product
name of the implementation and is completely free form. It is name of the implementation and is completely free form. It is
recommended that the nii_name be used to distinguish machine recommended that the nii_name be used to distinguish machine
architecture, machine platforms, revisions, versions, and patch architecture, machine platforms, revisions, versions, and patch
levels. The nii_date field is the timestamp of when the software levels. The nii_date field is the timestamp of when the software
instance was published or built. instance was published or built.
3.2.25. threshold_item4 3.2.24. threshold_item4
struct threshold_item4 { struct threshold_item4 {
layouttype4 thi_layout_type; layouttype4 thi_layout_type;
bitmap4 thi_hintset; bitmap4 thi_hintset;
opaque thi_hintlist<>; opaque thi_hintlist<>;
}; };
This structure contains a list of hints specific to a layout type for This structure contains a list of hints specific to a layout type for
helping the client determine when it should issue I/O directly helping the client determine when it should issue I/O directly
through the metadata server vs. the data servers. The hint structure through the metadata server vs. the data servers. The hint structure
skipping to change at page 83, line 31 skipping to change at page 83, line 31
| threshold4_read_iosize | 2 | length4 | For read I/O sizes below | | threshold4_read_iosize | 2 | length4 | For read I/O sizes below |
| | | | this threshold it is | | | | | this threshold it is |
| | | | recommended to read data | | | | | recommended to read data |
| | | | through the MDS | | | | | through the MDS |
| threshold4_write_iosize | 3 | length4 | For write I/O sizes below | | threshold4_write_iosize | 3 | length4 | For write I/O sizes below |
| | | | this threshold it is | | | | | this threshold it is |
| | | | recommended to write data | | | | | recommended to write data |
| | | | through the MDS | | | | | through the MDS |
+-------------------------+---+---------+---------------------------+ +-------------------------+---+---------+---------------------------+
3.2.26. mdsthreshold4 3.2.25. mdsthreshold4
struct mdsthreshold4 { struct mdsthreshold4 {
threshold_item4 mth_hints<>; threshold_item4 mth_hints<>;
}; };
This structure holds an array of threshold_item4 structures each of This structure holds an array of threshold_item4 structures each of
which is valid for a particular layout type. An array is necessary which is valid for a particular layout type. An array is necessary
since a server can support multiple layout types for a single file. since a server can support multiple layout types for a single file.
4. Filehandles 4. Filehandles
skipping to change at page 93, line 21 skipping to change at page 93, line 21
| | | | | would retrieve all | | | | | | would retrieve all |
| | | | | mandatory and | | | | | | mandatory and |
| | | | | recommended | | | | | | recommended |
| | | | | attributes that are | | | | | | attributes that are |
| | | | | supported for this | | | | | | supported for this |
| | | | | object. The scope | | | | | | object. The scope |
| | | | | of this attribute | | | | | | of this attribute |
| | | | | applies to all | | | | | | applies to all |
| | | | | objects with a | | | | | | objects with a |
| | | | | matching fsid. | | | | | | matching fsid. |
| type | 1 | nfs4_ftype | READ | The type of the | | type | 1 | nfs_ftype4 | READ | The type of the |
| | | | | object (file, | | | | | | object (file, |
| | | | | directory, symlink, | | | | | | directory, symlink, |
| | | | | etc.) | | | | | | etc.) |
| fh_expire_type | 2 | uint32 | READ | Server uses this to | | fh_expire_type | 2 | uint32 | READ | Server uses this to |
| | | | | specify filehandle | | | | | | specify filehandle |
| | | | | expiration behavior | | | | | | expiration behavior |
| | | | | to the client. See | | | | | | to the client. See |
| | | | | the section | | | | | | the section |
| | | | | "Filehandles" for | | | | | | "Filehandles" for |
| | | | | additional | | | | | | additional |
skipping to change at page 109, line 38 skipping to change at page 109, line 38
The dirent_notif_delay attribute is the minimum number of seconds the The dirent_notif_delay attribute is the minimum number of seconds the
server will delay before notifying the client of a change to a file server will delay before notifying the client of a change to a file
object that has an entry in the directory. object that has an entry in the directory.
5.13. PNFS Attributes 5.13. PNFS Attributes
5.13.1. fs_layout_type 5.13.1. fs_layout_type
The fs_layout_type attribute (data type layouttype4, see The fs_layout_type attribute (data type layouttype4, see
Section 3.2.15) applies to a file system and indicates what layout Section 3.2.14) applies to a file system and indicates what layout
types are supported by the file system. This attribute is expected types are supported by the file system. This attribute is expected
be queried when a client encounters a new fsid. This attribute is be queried when a client encounters a new fsid. This attribute is
used by the client to determine if it supports the layout type. used by the client to determine if it supports the layout type.
5.13.2. layout_alignment 5.13.2. layout_alignment
The layout_alignment attribute indicates the preferred alignment for The layout_alignment attribute indicates the preferred alignment for
I/O to files on the file system the client has layouts for. Where I/O to files on the file system the client has layouts for. Where
possible, the client should issue READ and WRITE operations with possible, the client should issue READ and WRITE operations with
offsets are whole multiples of the layout_alignment attribute. offsets are whole multiples of the layout_alignment attribute.
skipping to change at page 110, line 16 skipping to change at page 110, line 16
The layout_blksize attribute indicates the preferred block size for The layout_blksize attribute indicates the preferred block size for
I/O to files on the file system the client has layouts for. Where I/O to files on the file system the client has layouts for. Where
possible, the client should issue READ operations with a count possible, the client should issue READ operations with a count
argument that is a whole multiple of layout_blksize, and WRITE argument that is a whole multiple of layout_blksize, and WRITE
operations with a data argument of size that is a whole multiple of operations with a data argument of size that is a whole multiple of
layout_blksize. layout_blksize.
5.13.4. layout_hint 5.13.4. layout_hint
The layout_hint attribute (data type layouthint4, see Section 3.2.22) The layout_hint attribute (data type layouthint4, see Section 3.2.21)
may be set on newly created files to influence the metadata server's may be set on newly created files to influence the metadata server's
choice for the file's layout. It is suggested that this attribute is choice for the file's layout. It is suggested that this attribute is
set as one of the initial attributes within the OPEN call. The set as one of the initial attributes within the OPEN call. The
metadata server may ignore this attribute. This attribute is a sub- metadata server may ignore this attribute. This attribute is a sub-
set of the layout structure returned by LAYOUTGET. For example, set of the layout structure returned by LAYOUTGET. For example,
instead of specifying particular devices, this would be used to instead of specifying particular devices, this would be used to
suggest the stripe width of a file. It is up to the server suggest the stripe width of a file. It is up to the server
implementation to determine which fields within the layout it uses. implementation to determine which fields within the layout it uses.
5.13.5. layout_type 5.13.5. layout_type
This attribute indicates the particular layout type(s) used for a This attribute indicates the particular layout type(s) used for a
file. This is for informational purposes only. The client needs to file. This is for informational purposes only. The client needs to
use the LAYOUTGET operation in order to get enough information (e.g., use the LAYOUTGET operation in order to get enough information (e.g.,
specific device information) in order to perform I/O. specific device information) in order to perform I/O.
5.13.6. mdsthreshold 5.13.6. mdsthreshold
This attribute acts as a hint to the client to help it determine when This attribute is a server provided hint used to communicate to the
it is more efficient to issue read and write requests to the metadata client when it is more efficient to issue read and write requests to
server vs. the data server. Two types of thresholds are described: the metadata server or the data server. The two types of thresholds
file size thresholds and I/O size thresholds. If a file's size is described are file size thresholds and I/O size thresholds. If a
smaller than the file size threshold, data accesses should be issued file's size is smaller than the file size threshold, data accesses
to the metadata server. If an I/O is below the I/O size threshold, should be issued to the metadata server. If an I/O is below the I/O
the I/O should be issued to the metadata server. Each threshold can size threshold, the I/O should be issued to the metadata server. As
be specified independently for read and write requests. For either defined, each threshold type is specified separately for read and
threshold type, a value of 0 indicates no read or write should be write.
issued to the metadata server, while a value of all 1s indicates all
reads or writes should be issued to the metadata server. The server may provide both types of thresholds for a file. If both
file size and I/O size are provided, the client must exceed both
thresholds before issuing its read or write requests to the data
server. Alternatively, if only one of the specified thresholds is
exceeded, the I/O requests are issued to the metadata server.
For each threshold type, a value of 0 indicates no read or write
should be issued to the metadata server, while a value of all 1s
indicates all reads or writes should be issued to the metadata
server.
The attribute is available on a per filehandle basis. If the current The attribute is available on a per filehandle basis. If the current
filehandle refers to a non-pNFS file or directory, the metadata filehandle refers to a non-pNFS file or directory, the metadata
server should return an attribute that is representative of the server should return an attribute that is representative of the
filehandle's file system. It is suggested that this attribute is filehandle's file system. It is suggested that this attribute is
queried as part of the OPEN operation. Due to dynamic system queried as part of the OPEN operation. Due to dynamic system
changes, the client should not assume that the attribute will remain changes, the client should not assume that the attribute will remain
constant for any specific time period, thus it should be periodically constant for any specific time period, thus it should be periodically
refreshed. refreshed.
skipping to change at page 143, line 11 skipping to change at page 143, line 50
associated with the current session. Stateids apply to all sessions associated with the current session. Stateids apply to all sessions
associated with the given client ID and the client may use a stateid associated with the given client ID and the client may use a stateid
obtained from one session on another session associated with the same obtained from one session on another session associated with the same
client ID. client ID.
8.1.3.1. Stateid Structure 8.1.3.1. Stateid Structure
Stateids are divided into two fields, a 96-bit "other" field Stateids are divided into two fields, a 96-bit "other" field
identifying the specific set of locks and a 32-bit "seqid" sequence identifying the specific set of locks and a 32-bit "seqid" sequence
value. Except in the case of special stateids, to be discussed value. Except in the case of special stateids, to be discussed
below, the purpose of the sequence value within NFSv4.1 is to allow below, the NFSv4.1 specification requires the server to increment
the replier to communicate to the requester the order in which seqid field by one (1) whenever it returns a stateid with an "other"
operations that modified locking state associated with a stateid have field on that is different that that of the previous stateid it
been processed. generated for the state owner/file combination. The purpose of the
incrementing the seqid is to allow the replier to communicate to the
requester the order in which operations that modified locking state
associated with a stateid have been processed. This is necessary for
the scenario where the state owner is sending multiple parallel
operations with the same stateid as an argument, or in the case of
OPEN, the same file as an argument. In this scenario, at least one
returned stateid differs from the other returned stateids. Without
knowing the order of how the operations executed, the client cannot
tell which of the returned stateids corresponds to the current state
of the file/state owner combination. This is a problem because
subsequent operations on the same file/state owner combination
require the latest stateid to be used in the arguments The visibility
of the "seqid" value in the stateid allows a client to determine
which of the returned stateids is the latest.
In the case of stateids associated with opens, i.e. the stateids In the case of stateids associated with opens, i.e. the stateids
returned by OPEN (the state for the open, rather than that for the returned by OPEN (the state for the open, rather than that for the
delegation), OPEN_DOWNGRADE, or CLOSE, the server MUST provide an delegation), OPEN_DOWNGRADE, or CLOSE, the server MUST provide an
"seqid" value starting at one for the first use of a given "other" "seqid" value starting at one for the first use of a given "other"
value and incremented by one with each subsequent operation returning value and incremented by one with each subsequent operation returning
a stateid. a stateid.
In the case of other sorts of stateids (i.e. stateids associated with In the case of other sorts of stateids (i.e. stateids associated with
record locks and delegations), the server MAY provide an incrementing record locks and delegations), the server MAY provide an incrementing
skipping to change at page 144, line 4 skipping to change at page 145, line 10
specific value of the "seqid" field. specific value of the "seqid" field.
The following combinations of "other" and "seqid" are defined in The following combinations of "other" and "seqid" are defined in
NFSv4.1: NFSv4.1:
o When "other" and "seqid" are both zero, the stateid is treated as o When "other" and "seqid" are both zero, the stateid is treated as
a special anonymous stateid, which can be used in READ, WRITE, and a special anonymous stateid, which can be used in READ, WRITE, and
SETATTR requests to indicate the absence of any open state SETATTR requests to indicate the absence of any open state
associated with the request. When an anonymous stateid value is associated with the request. When an anonymous stateid value is
used, and an existing open denies the form of access requested, used, and an existing open denies the form of access requested,
then access will be denied to the request. then access will be denied to the request. This stateid MUST NOT
be used on operations to data servers (Section 13.7), nor may it
be used as the argument to the WANT_DELEGATTION (Section 17.49)
operation.
o When "other" and "seqid" are both all ones, the stateid is a o When "other" and "seqid" are both all ones, the stateid is a
special read bypass stateid. When this value is used in WRITE or special read bypass stateid. When this value is used in WRITE or
SETATTR, it is treated like the anonymous value. When used in SETATTR, it is treated like the anonymous value. When used in
READ, the server MAY grant access, even if access would normally READ, the server MAY grant access, even if access would normally
be denied to READ requests. be denied to READ requests. This stateid MUST NOT be used on
operations to data servers, nor may it be used as the argument to
the WANT_DELEGATION operation.
o When "other" is zero and "seqid" is one, the stateid represents o When "other" is zero and "seqid" is one, the stateid represents
the current stateid, which is whatever value is the last stateid the current stateid, which is whatever value is the last stateid
returned by an operation within the COMPOUND. In the case of an returned by an operation within the COMPOUND. In the case of an
OPEN, the stateid returned for the open file, and not the OPEN, the stateid returned for the open file, and not the
delegation is used. The stateid passed to the operation in place delegation is used. The stateid passed to the operation in place
of the special value has its "seqid" value set to zero. If there of the special value has its "seqid" value set to zero. If there
is no operation in the COMPOUND which has returned a stateid is no operation in the COMPOUND which has returned a stateid
value, the server MUST return the error NFS4ERR_BAD_STATEID. value, the server MUST return the error NFS4ERR_BAD_STATEID.
skipping to change at page 209, line 27 skipping to change at page 210, line 27
were moved to. At this point the client and server should be back in were moved to. At this point the client and server should be back in
sync and the client can resume normal operation. If it still gets sync and the client can resume normal operation. If it still gets
SEQ4_STATUS_LEASE_MOVED and the state lingers (i.e. another scan of SEQ4_STATUS_LEASE_MOVED and the state lingers (i.e. another scan of
the filesystems it knows of does not yield new NFS4ERR_MOVED the filesystems it knows of does not yield new NFS4ERR_MOVED
indications) it can destroy the session to release all of its state indications) it can destroy the session to release all of its state
on the server and get back in sync with the server. It should be on the server and get back in sync with the server. It should be
said, however, that destroying the session clears the aforementioned said, however, that destroying the session clears the aforementioned
lease_moved "state" (if it indeed does so).]] [[Comment.9: Comment lease_moved "state" (if it indeed does so).]] [[Comment.9: Comment
from Trond: What does the error SEQ4_STATUS_LEASE_MOVED mean? A from Trond: What does the error SEQ4_STATUS_LEASE_MOVED mean? A
lease is supposed to be global to the client, whereas fs_locations lease is supposed to be global to the client, whereas fs_locations
returns information about a specific filesystem. What exactly is the returns information about a specific file system. What exactly is
client expected to do if the original server exported 2 filesystems the client expected to do if the original server exported 2 file
that are now being migrated to 2 different servers? ( [...] I still systems that are now being migrated to 2 different servers? ( [...]
don't see [for example ] what particular operation is the client I still don't see [for example ] what particular operation is the
guaranteed to be able to perform on each file system?)]] client guaranteed to be able to perform on each file system?)]]
10.6.7.2. Transitions and the Lease_time Attribute 10.6.7.2. Transitions and the Lease_time Attribute
In order that the client may appropriately manage its leases in the In order that the client may appropriately manage its leases in the
case of a file system transition, the destination server must case of a file system transition, the destination server must
establish proper values for the lease_time attribute. establish proper values for the lease_time attribute.
When state is transferred transparently, that state should include When state is transferred transparently, that state should include
the correct value of the lease_time attribute. The lease_time the correct value of the lease_time attribute. The lease_time
attribute on the destination server must never be less than that on attribute on the destination server must never be less than that on
skipping to change at page 234, line 45 skipping to change at page 235, line 45
and notifications. Thus, the client is required to establish a new and notifications. Thus, the client is required to establish a new
delegation on a server or client reboot. [[Comment.10: we have delegation on a server or client reboot. [[Comment.10: we have
special reclaim types allow clients to recovery delegations through special reclaim types allow clients to recovery delegations through
client reboot. Do we really want EXCHANGE_ID/CREATE_SESSION to client reboot. Do we really want EXCHANGE_ID/CREATE_SESSION to
destroy directory delegation state?]] destroy directory delegation state?]]
12. Parallel NFS (pNFS) 12. Parallel NFS (pNFS)
12.1. Introduction 12.1. Introduction
PNFS is a set of OPTIONAL features of NFSv4.1 which allow direct pNFS is a set of optional features within NFSv4.1; the pNFS feature
client access to the storage devices containing the file data. When set allows direct client access to the storage devices containing
file data for a single NFSv4 server is stored on multiple and/or file data. When file data for a single NFSv4 server is stored on
higher throughput storage devices (by comparison to the server's multiple and/or higher throughput storage devices (by comparison to
throughput capability), the result can be significantly better file the server's throughput capability), the result can be significantly
access performance. The relationship among multiple clients, a better file access performance. The relationship among multiple
single server, and multiple storage devices for pNFS (server and clients, a single server, and multiple storage devices for pNFS
clients have access to all storage devices) is shown in this diagram: (server and clients have access to all storage devices) is shown in
this diagram:
+-----------+ +-----------+
|+-----------+ +-----------+ |+-----------+ +-----------+
||+-----------+ | | ||+-----------+ | |
||| | NFSv4 + pNFS | | ||| | NFSv4.1 + pNFS | |
+|| Clients |<------------------------------>| Server | +|| Clients |<------------------------------>| Server |
+| | | | +| | | |
+-----------+ | | +-----------+ | |
||| +-----------+ ||| +-----------+
||| | ||| |
||| | ||| |
||| Storage +-----------+ | ||| Storage +-----------+ |
||| Protocol |+-----------+ | ||| Protocol |+-----------+ |
||+----------------||+-----------+ Control | ||+----------------||+-----------+ Control |
|+-----------------||| | Protocol| |+-----------------||| | Protocol|
+------------------+|| Storage |------------+ +------------------+|| Storage |------------+
+| Devices | +| Devices |
+-----------+ +-----------+
Figure 64 Figure 65
In this structure, the responsibility for coordination of file access In this model, the clients, server, and storage devices are
by multiple clients is shared among the server, clients, and storage responsible for managing file access. This is in contrast to NFSv4
devices. This is in contrast to NFSv4 without pNFS in which this is without pNFS where it is primarily the server's responsibility; some
primarily the server's responsibility, some of which can be delegated of this responsibility may be deleted to client under strictly
to clients under strictly specified conditions. specified conditions.
PNFS takes the form of OPTIONAL operations that manage data location pNFS takes the form of OPTIONAL operations that manage protocol
information called a layout. The layout is managed in a similar objects called 'layouts' which contain data location information.
fashion as NFSv4 data delegations (e.g., they are recallable and The layout is managed in a similar fashion as NFSv4.1 data
revocable). However, they are distinct abstractions and are delegations are managed. For example, the layout is leased,
manipulated with new operations. When a client holds a layout, it recallable and revocable. However, layouts are distinct abstractions
has rights to access the data directly using the location information and are manipulated with new operations. When a client holds a
in the layout. layout, is delegated the ability to access the data location directly
using the location information specified in the layout.
This document specifies the use of NFSv4.1 as a storage protocol. The NFSv4.1 pNFS feature has been structured to allow for a variety
PNFS allows other storage protocols, and these protocols are of storage protocols to be defined and used. As noted in the diagram
deliberately not specified here. These might include: above, the storage protocol is the method used by the client to store
and retrieve data directly from the storage devices. The NFSv4.1
protocol directly defines one storage protocol, the NFSv4.1 storage
type, and its use.
Examples of other storage protocols that could be used with NFSv4.1's
pNFS are:
o Block/volume protocols such as iSCSI ([35]), and FCP ([36]). The o Block/volume protocols such as iSCSI ([35]), and FCP ([36]). The
block/volume protocol support can be independent of the addressing block/volume protocol support can be independent of the addressing
structure of the block/volume protocol used, allowing more than structure of the block/volume protocol used, allowing more than
one protocol to access the same file data and enabling one protocol to access the same file data and enabling
extensibility to other block/volume protocols. extensibility to other block/volume protocols.
o Object protocols such as OSD over iSCSI or Fibre Channel [37]. o Object protocols such as OSD over iSCSI or Fibre Channel [37].
o Other storage protocols, including PVFS and other file systems o Other storage protocols, including PVFS and other file systems
that are in use in HPC environments. that are in use in HPC environments.
With some storage protocols, the storage devices cannot perform fine- It is possible that various storage protocols are available to both
grained access checks to ensure that clients are only performing client and server and it may be possible that a client and server do
accesses within the bounds permitted to them by the pNFS operations not have a matching storage protocol available to them. Because of
with the server (e.g., the checks may only be possible at file system this, the pNFS server MUST support normal NFSv4.1 access to any file
granularity rather than file granularity). In situations where this accessible by the pNFS feature; this will allow for continued
added responsibility placed on clients creates unacceptable security interoperability between a NFSv4.1 client and server.
risks, pNFS configurations in which storage devices cannot perform
fine-grained access checks SHOULD NOT be used. All pNFS server
implementations MUST support NFSv4.1 access to any file accessible
via pNFS in order to provide an interoperable means of file access in
such situations. See Section 12.9 on Security for further
discussion.
There are issues about how layouts interact with the existing NFSv4 There are interesting interactions between layouts and other NFSv4.1
abstractions of data delegations and record locking. Delegation abstractions such as data delegations and record locking. Delegation
issues are discussed in Section 12.5.4. Byte range locking issues issues are discussed in Section 12.5.4. Byte range locking issues
are discussed in Section 12.2.10 and Section 12.5.1. are discussed in Section 12.2.9 and Section 12.5.1.
12.2. PNFS Definitions 12.2. pNFS Definitions
PNFS partitions the NFSv4.1 file system protocol into two parts, the NFSv4.1's pNFS feature partitions the file system protocol into two
metadata path and the data path. The metadata path is implemented by parts: metadata and data. Where data is the contents of a file and
a metadata server that supports pNFS and the operations described in metadata is "everything else". The metadata functionality is
this document (Section 17). The data path is implemented by a implemented by a metadata server that supports pNFS and the
storage device that supports the storage protocol. A subset (defined operations described in (Section 17). The data functionality is
in Section 13.7) of NFSv4.1 is one such storage protocol. This leads implemented by a storage device that supports the storage protocol.
to new terms used to describe the protocol extension and some A subset (defined in Section 13.7) of NFSv4.1 itself is one such
clarifications of existing terms. storage protocol. New terms are introduced to the NFSv4.1
nomenclature and existing terms are clarified to allow for the
description of the pNFS feature.
12.2.1. Metadata 12.2.1. Metadata
This is information about a file, such as its name, owner, where it Information about a file system object, such as its name, location
stored, and so forth. Metadata also includes lower-level information within the namespace, owner, ACL and other attributes. Metadata may
like block addresses and indirect block pointers. also include storage location information and this will vary based on
the underlying storage mechanism that is used.
12.2.2. Metadata Server 12.2.2. Metadata Server
A pNFS metadata server is an NFSv4.1 server which supports pNFS An NFSv4.1 server which supports the pNFS feature. A variety of
operations and features. When supporting pNFS the metadata server architectural choices exists for the metadata server and its use of
might hold only the metadata associated with a file, while the data what file system information is held at the server. Some servers may
can be stored on the storage devices. However, data may also be contain metadata only for the file objects that reside at the
written through the metadata server which in turn ensures data is metadata server while file data resides on the associated storage
written to the storage devices. devices. Other metadata servers may hold both metadata and a varying
degree of file data.
12.2.3. Client 12.2.3. pNFS Client
A pNFS client is a NFSv4.1 client as defined by this document, which An NFSv4.1 client that supports pNFS operations and supports at least
supports pNFS operations and features, and supports least one storage one storage protocol or layout type for performance I/O to storage
protocol for performing I/O directly to storage devices. devices.
12.2.4. Storage Device 12.2.4. Storage Device
A storage device controls a regular file's data, but leaves other A storage device stores a regular file's data, but leaves metadata
metadata management up to the metadata server. A storage device management to the metadata server. A storage device could be another
could be another NFSv4.1 server, an object storage device (OSD), a NFSv4.1 server, an object storage device (OSD), a block device
block device accessed over a SAN (e.g., either FiberChannel or iSCSI accessed over a SAN (e.g., either FiberChannel or iSCSI SAN), or some
SAN), or some other entity. other entity.
12.2.5. Data Server
A data server is a storage device that is implemented by a server of
higher level storage access protocol, such as NFSv4.1.
12.2.6. Storage Protocol or Data Protocol
A storage protocol or data protocol is the used between the pNFS 12.2.5. Storage Protocol
client and the storage device to access the file data. Three layout
types have been described: file protocols (i.e., NFSv4.1), object
protocols (e.g., OSD), and block/volume protocols (e.g., based on
SCSI-block commands). These protocols are in turn realizable over a
variety of transport stacks.
Depending the storage protocol, block-level metadata may or may not A storage protocol is the protocol used between the pNFS client and
be managed by the metadata server, but is instead managed by object the storage device to access the file data.
storage devices or other servers acting as a storage device.
12.2.7. Control Protocol 12.2.6. Control Protocol
The control protocol is used by the exported file system between the The control protocol is used by the exported file system between the
metadata server and storage devices. Specification of such protocols metadata server and storage devices. Specification of such protocols
is outside the scope of this document. Such control protocols would is outside the scope of the NFSv4.1 protocol. Such control protocols
be used to control such activities as the allocation and deallocation would be used to control activities such as the allocation and
of storage and the management of state required by the data servers deallocation of storage and the management of state required by the
to perform client access control. storage devices to perform client access control.
While pNFS allows for any control protocol, in practice the control
protocol is closely related to the storage protocol. For example, if
the data servers are NFSv4.1 servers, then the protocol between the
metadata server and the data servers is likely to involve NFSv4.1
operations. Similarly, when object storage devices are used, the
pNFS metadata server will likely use iSCSI/OSD commands to manipulate
storage.
Regardless, this document does not mandate any particular control
protocol. Instead, it just describes the requirements on the control
protocol for maintaining attributes like modify time, the change
attribute, and the end-of-file (EOF) position.
12.2.8. Layout
A layout defines how a file's data is organized on one or more A particular control protocol is not mandated by NFSv4.1 but
storage devices. There are many possible layout types. They vary in requirements are placed on the control protocol for maintaining
the storage protocol used to access the data, and in the aggregation attributes like modify time, the change attribute, and the end-of-
scheme that lays out the file data on the underlying storage devices. file (EOF) position.
A layout is more precisely identified by the following tuple:
<Client, filehandle, layout type>; where filehandle refers to the
filehandle of the file on the metadata server. Layouts describe a
file, not an octet-range of a file; Section 12.2.11 describes layout
segments which do pertain to a range.
12.2.9. Layout Types 12.2.7. Layout Types
A layout describes the mapping of a file's data to the storage A layout describes the mapping of a file's data to the storage
devices that hold the data. A layout is said to belong to a specific devices that hold the data. A layout is said to belong to a specific
layout type (data type layouttype4, see Section 3.2.15). The layout layout type (data type layouttype4, see Section 3.2.14). The layout
type allows for variants to handle different storage protocols, such type allows for variants to handle different storage protocols, such
as block/volume [30], object [29], and file (Section 13) layout as those associated with block/volume [30], object [29], and file
types. A metadata server, along with its control protocol, MUST (Section 13) layout types. A metadata server, along with its control
support at least one layout type. A private sub-range of the layout protocol, MUST support at least one layout type. A private sub-range
type name space is also defined. Values from the private layout type of the layout type name space is also defined. Values from the
range can be used for internal testing or experimentation. private layout type range MAY be used for internal testing or
experimentation.
As an example, a file layout type could be an array of tuples (e.g., As an example, a file layout type could be an array of tuples (e.g.,
deviceID, file_handle), along with a definition of how the data is deviceID, file_handle), along with a definition of how the data is
stored across the devices (e.g., striping). A block/volume layout stored across the devices (e.g., striping). A block/volume layout
might be an array of tuples that store <deviceID, block_number, block might be an array of tuples that store <deviceID, block_number, block
count> along with information about block size and the file offset of count> along with information about block size and the associated
the first block. An object layout might be an array of tuples file offset of the block number. An object layout might be an array
<deviceID, objectID> and an additional structure (i.e., the of tuples <deviceID, objectID> and an additional structure (i.e., the
aggregation map) that defines how the logical octet sequence of the aggregation map) that defines how the logical octet sequence of the
file data is serialized into the different objects. Note, the actual file data is serialized into the different objects. Note that the
layouts are more complex than these simple expository examples. actual layouts are typically more complex than these simple
expository examples.
12.2.10. Layout Iomode 12.2.8. Layout
The layout iomode (data type layoutiomode4, see Section 3.2.23) A layout defines how a file's data is organized on one or more
storage devices. There are many potential layout types; each of the
layout types are differentiated by the storage protocol used to
access data and in the aggregation scheme that lays out the file data
on the underlying storage device. A layout is precisely identified
by the following tuple: <client ID, filehandle, layout type, iomode,
range>; where filehandle refers to the filehandle of the file on the
metadata server.
It is important to define when layouts overlap and/or conflict with
each other. For two layouts with overlapping octet ranges to
actually overlap each other, both layouts must be of the same layout
type, correspond to the same filehandle, and have the same iomode.
Layouts conflict when they overlap and differ in the content of the
layout (i.e., the storage device/file mapping parameters differ).
Note that differing iomodes do not lead to conflicting layouts. It
is permissible for layouts with different iomodes, pertaining to the
same octet range, to be held by the same client. An example of this
would be copy-on-write functionality for a block/volume layout type.
12.2.9. Layout Iomode
The layout iomode (data type layoutiomode4, see Section 3.2.22)
indicates to the metadata server the client's intent to perform indicates to the metadata server the client's intent to perform
either just READ operations (Section 17.22) or a mixture of I/O either just READ operations (Section 17.22) or a mixture of I/O
possibly containing WRITE (Section 17.32) and READ operations. For possibly containing WRITE (Section 17.32) and READ operations. For
certain layout types, it is useful for a client to specify this certain layout types, it is useful for a client to specify this
intent at LAYOUTGET (Section 17.43) time. E.g., for block/volume intent at LAYOUTGET (Section 17.43) time. For example, block/volume
based protocols, block allocation could occur when a READ/WRITE based protocols, block allocation could occur when a READ/WRITE
iomode is specified. A special LAYOUTIOMODE4_ANY iomode is defined iomode is specified. A special LAYOUTIOMODE4_ANY iomode is defined
and can only be used for LAYOUTRETURN and LAYOUTRECALL, not for and can only be used for LAYOUTRETURN and LAYOUTRECALL, not for
LAYOUTGET. It specifies that layouts pertaining to both READ and LAYOUTGET. It specifies that layouts pertaining to both READ and
READ/WRITE iomodes are being returned or recalled, respectively. READ/WRITE iomodes are being returned or recalled, respectively.
A storage device may validate I/O with regards to the iomode; this is A storage device may validate I/O with regards to the iomode; this is
dependent upon storage device implementation and layout type. Thus, dependent upon storage device implementation and layout type. Thus,
if the client's layout iomode differs from the I/O being performed, if the client's layout iomode is inconsistent with the I/O being
the storage device may reject the client's I/O with an error performed, the storage device may reject the client's I/O with an
indicating a new layout with the correct I/O mode should be fetched. error indicating a new layout with the correct I/O mode should be
E.g., if a client gets a layout with a READ iomode and performs a fetched. For example, if a client gets a layout with a READ iomode
WRITE to a storage device, the storage device is allowed to reject and performs a WRITE to a storage device, the storage device is
that WRITE. allowed to reject that WRITE.
The iomode does not conflict with OPEN share modes or lock requests; The iomode does not conflict with OPEN share modes or lock requests;
open mode checks and lock enforcement are always enforced, and are open mode and lock conflicts are enforced as they are without the use
logically separate from the pNFS layout level. As well, open modes of pNFS, and are logically separate from the pNFS layout level. As
and locks are the preferred method for restricting user access to well, open modes and locks are the preferred method for restricting
data files. E.g., an OPEN of read, deny-write does not conflict with user access to data files. For example, an OPEN of read, deny-write
a LAYOUTGET containing an iomode of READ/WRITE performed by another does not conflict with a LAYOUTGET containing an iomode of READ/WRITE
client. Applications that depend on writing into the same file performed by another client. Applications that depend on writing
concurrently may use record locking to serialize their accesses. into the same file concurrently may use record locking to serialize
their accesses.
12.2.11. Layout Segment
Since a layout that describes an entire file may be very large, there
is a desire to manage layouts in smaller chunks that correspond to
octet-ranges of the file. For example, the entire layout need not be
returned, recalled, or committed. These chunks are called layout
segments and are further identified by the octet-range and iomode
they represent, yielding a layout segment identifier consisting of
<client ID, filehandle, layout type, range, iomode>. The concepts of
a layout and its layout segments allow clients and metadata servers
to aggregate the results of layout operations into a singly
maintained layout.
It is important to define when layout segments overlap and/or
conflict with each other. For two layout segments with overlapping
octet ranges to actually overlap each other, both segments must be of
the same layout type, correspond to the same filehandle, and have the
same iomode. Layout segments conflict, when they overlap and differ
in the content of the layout (i.e., the storage device/file mapping
parameters differ). Note, differing iomodes do not lead to
conflicting layouts. It is permissible for layout segments with
different iomodes, pertaining to the same octet range, to be held by
the same client.
12.2.12. Device IDs 12.2.10. Device IDs
The device ID (data type deviceid4, see Section 3.2.16) names a The device ID (data type deviceid4, see Section 3.2.15) names a
storage device. In practice, a significant amount of information may storage device. In practice, a significant amount of information may
be required to fully address a storage device. Instead of embedding be required to fully address a storage device. Rather than embedding
all that information in a layout, layouts embed device IDs. The all such information in a layout, layouts embed device IDs. The
NFSv4.1 operation GETDEVICEINFO (Section 17.40) is used to retrieve NFSv4.1 operation GETDEVICEINFO (Section 17.40) is used to retrieve
the complete address information about the storage device according the complete address information regarding the storage device
to its layout type. For example, the address of an NFSv4.1 data according to its layout type and device ID. For example, the address
server or of an object storage device could be an IP address and of an NFSv4.1 data server or of an object storage device could be an
port. The address of a block storage device could be a volume label. IP address and port. The address of a block storage device could be
a volume label.
The device ID is qualified by the layout type and unique per file The device ID is qualified by the layout type and is unique per file
system identifier (FSID, see Section 3.2.5). This allows different system identifier (FSID, see Section 3.2.5).
layout drivers to generate device IDs without the need for co-
ordination.
Clients cannot expect the mapping between device ID and storage Clients cannot expect the mapping between device ID and storage
device address to persist across metadata server restart. See device address to persist across metadata server restart. See
Section 12.7.4 for a description of how recovery works in that Section 12.7.4 for a description of how recovery works in that
situation. situation.
12.3. PNFS Operations 12.3. pNFS Operations
NFSv4.1 has several operations that are needed for pNFS servers, NFSv4.1 has several operations that are needed for pNFS servers,
regardless of layout type or storage protocol. These operations are regardless of layout type or storage protocol. These operations are
all issued to a metadata server and summarized here. all issued to a metadata server and summarized here. Even though
pNFS is an OPTIONAL feature of NFSv4.1, if a server is supporting the
pNFS feature, it MUST support all of the pNFS operations.
GETDEVICEINFO. As noted previously (Section 12.2.12), GETDEVICEINFO GETDEVICEINFO. As noted previously (Section 12.2.10), GETDEVICEINFO
(Section 17.40) returns the mapping of device ID to storage device (Section 17.40) returns the mapping of device ID to storage device
address. address.
GETDEVICELIST (Section 17.41), allows clients to fetch the all the GETDEVICELIST (Section 17.41), allows clients to fetch all of the
device ID to storage device address mappings of particular file mappings of device IDs to storage device addresses for a specific
system. file system.
LAYOUTGET (Section 17.43) is used by a client to get a layout LAYOUTGET (Section 17.43) is used by a client to get a layout for a
segment for a file. file.
LAYOUTCOMMIT (Section 17.42) is used to inform the metadata server LAYOUTCOMMIT (Section 17.42) is used to inform the metadata server
that the client wants to commit data it wrote to the storage of the client's intent to commit data which has been written to
device (which as indicated in the layout segment returned by the storage device; the storage device as originally indicated in
LAYOUTGET). the return value of LAYOUTGET.
LAYOUTRETURN (Section 17.44) is used to return a layout segment or LAYOUTRETURN (Section 17.44) is used to return layouts for an FSID
all layouts belong to a file system to a metadata server. and for client ID.
The following pNFS-related operations are callback operations a The pNFS client MUST implement these callback operations. A pNFS
metadata server might issue to a pNFS client. server MAY use them as appropriate. Also note that recalls can be
expected for the various layout types such as object and block/volume
layouts.
CB_LAYOUTRECALL (Section 19.3) recalls a layout segment or all CB_LAYOUTRECALL (Section 19.3) recalls a layout or all layouts
layouts belonging to a file system, or all layouts belong to a belonging to a file system, or all layouts belonging to a client
client ID. ID.
CB_RECALL_ANY (Section 19.6), tells a client that it needs to return CB_RECALL_ANY (Section 19.6), tells a client that it needs to return
some number of recallable objects, including layouts, to the some number of recallable objects, including layouts, to the
metadata server. metadata server.
CB_RECALLABLE_OBJ_AVAIL (Section 19.7) tells a client that a CB_RECALLABLE_OBJ_AVAIL (Section 19.7) tells a client that a
recallable object that it was denied (in case of pNFS, a layout, recallable object that it was denied (in case of pNFS, a layout,
denied by LAYOUTGET) due to resource exhaustion, is now available. denied by LAYOUTGET) due to resource exhaustion, is now available.
12.4. PNFS Attributes 12.4. pNFS Attributes
A number of attributes specific to pNFS are listed and described in A number of attributes specific to pNFS are listed and described in
Section 5.13 Section 5.13
12.5. Layout Semantics 12.5. Layout Semantics
12.5.1. Guarantees Provided by Layouts 12.5.1. Guarantees Provided by Layouts
Layouts delegate to the client the ability to access data out of Layouts delegate to the client the ability to access data located at
band. The layout guarantees the holder that the layout will be a storage device with the appropriate storage protocol. The client
recalled when the state encapsulated by the layout becomes invalid is guaranteed the layout will be recalled when one of two things
(e.g., through some operation that directly or indirectly modifies occur; either a conflicting layout is requested or the state
the layout) or, possibly, when a conflicting layout is requested, as encapsulated by the layout becomes invalid and this can happen when
determined by the layout's iomode. When a layout is recalled, and an event directly or indirectly modifies the layout. When a layout
then returned by the client, the client retains the ability to access is recalled and returned by the client, the client continues with the
file data with normal NFSv4.1 I/O operations through the metadata ability to access file data with normal NFSv4.1 operations through
server. Only the right to do I/O out-of-band is affected. the metadata server. Only the ability to access the storage devices
is affected.
Holding a layout does not guarantee that a user of the layout has the The requirement of NFSv4.1, that all user access rights MUST be
rights to access the data represented by the layout. All user access obtained through the appropriate open, lock, and access operations,
rights MUST be obtained through the appropriate open, lock, and is not modified with the existence of layouts. Layouts are provided
access operations (i.e., those that would be used in the absence of to NFSv4.1 clients and user access still follows the rules of the
pNFS). However, if a valid layout for a file is not held by the protocol as if they did not exist. It is a requirement that for a
client, the storage device should reject all I/Os to that file's client to access a storage device, a layout must be held by the
octet range that originate from that client. In summary, layouts and client. If a storage device receives an I/O for an octet range for
ordinary file access controls are independent. The act of modifying which the client does not hold a layout, the storage device SHOULD
a file for which a layout is held, does not necessarily conflict with reject that I/O request. Note that the act of modifying a file for
the holding of the layout that describes the file being modified. which a layout is held, does not necessarily conflict with the
However, with certain layout types (e.g., block/volume layouts), the holding of the layout that describes the file being modified.
layout's iomode must agree with the type of I/O being performed. Therefore, it is the requirement of the storage protocol or layout
type that determines the necessary behavior. For example, block/
volume layout types require that the layout's the layout's iomode
agree with the type of I/O being performed.
Depending upon the layout type and storage protocol in use, storage Depending upon the layout type and storage protocol in use, storage
device access permissions may be granted by LAYOUTGET and may be device access permissions may be granted by LAYOUTGET and may be
encoded within the type specific layout. If access permissions are encoded within the type-specific layout. For an example of storage
encoded within the layout, the metadata server must recall the layout device access permissions see an object based protocol such as [37].
when those permissions become invalid for any reason; for example If access permissions are encoded within the layout, the metadata
when a file becomes unwritable or inaccessible to a client. Note, server SHOULD recall the layout when those permissions become invalid
clients are still required to perform the appropriate access for any reason; for example when a file becomes unwritable or
operations as described above (e.g., open and lock ops). The degree inaccessible to a client. Note, clients are still required to
to which it is possible for the client to circumvent these access perform the appropriate access operations with open, lock and access
operations must be clearly addressed by the individual layout type as described above. The degree to which it is possible for the
documents, as well as the consequences of doing so. In addition, client to circumvent these access operations and the consequences of
these documents must be clear about the requirements and non- doing so must be clearly specified by the individual layout type
requirements for the checking performed by the server. specifications. In addition, these specifications must be clear
about the requirements and non-requirements for the checking
performed by the server.
If the pNFS metadata server supports mandatory record locks then In the presence of pNFS functionality, mandatory file locks MUST
record locks must behave as specified by the NFSv4.1 protocol, as behave as they would without pNFS. Therefore, if mandatory file
observed by users of files. If a storage device is unable to locks and layouts are provided simultaneously, the storage device
restrict access by a pNFS client which does not hold a required MUST be able to enforce the mandatory file locks. For example, if
mandatory record lock then the metadata server must not grant layouts one client obtains a mandatory lock and a second client accesses the
to a client, for that storage device, that permits any access that storage device, the storage device MUST appropriately restrict I/O
conflicts with a mandatory record lock held by another client. In for the byte range of the mandatory file lock. If the storage device
this scenario, it is also necessary for the metadata server to ensure is incapable of providing this check in the presence of mandatory
that record locks are not granted to a client if any other client file locks, the metadata server then MUST NOT grant layouts and
holds a conflicting layout (a layout that overlaps the range, and has mandatory file locks simultaneously.
an iomode that conflicts with the lock type); in this case all
conflicting layouts must be recalled and returned before the lock
request can be granted. This requires the metadata server to
understand the capabilities of its storage devices.
12.5.2. Getting a Layout 12.5.2. Getting a Layout
A client obtains a layout through a new operation, LAYOUTGET. The A client obtains a layout with the LAYOUTGET operation. The metadata
metadata server will give out layouts of a particular type (e.g., server will grant layouts of a particular type (e.g., block/volume,
block/volume, object, or file) and aggregation as requested by the object, or file). The client selects an appropriate layout type that
client. The client selects an appropriate layout type which the the server supports and the client is prepared to use. The layout
server supports and the client is prepared to use. The layout returned to the client may not exactly align with the requested octet
returned to the client may not line up exactly with the requested range. A field within the LAYOUTGET request, loga_minlength,
octet range. A field within the LAYOUTGET request, loga_minlength,
specifies the minimum overlap that MUST exist between the requested specifies the minimum overlap that MUST exist between the requested
layout and the layout returned by the metadata server. The layout and the layout returned by the metadata server. The
loga_minlength field should at least one. A metadata server may give loga_minlength field should be at least one. As needed a client may
out multiple overlapping, non-conflicting layout segments to the same make multiple LAYOUTGET requests; these will result in multiple
client in response to a LAYOUTGET. overlapping, non-conflicting layouts.
There is no implied ordering between getting a layout and performing There is no required ordering between getting a layout and performing
a file OPEN. For example, a layout may first be retrieved by placing a file OPEN. For example, a layout may first be retrieved by placing
a LAYOUTGET operation in the same COMPOUND as the initial file OPEN. a LAYOUTGET operation in the same COMPOUND as the initial file OPEN.
Once the layout has been retrieved, it can be held across multiple Once the layout has been retrieved, it can be held across multiple
OPEN and CLOSE sequences. OPEN and CLOSE sequences. Therefore, a client may hold a layout for
a file that is not currently open by any user on the client. This
allows for the caching of layouts beyond CLOSE such that CLOSE to
OPEN sequences do not incur additional layout overhead.
The storage protocol used by the client to access the data on the The storage protocol used by the client to access the data on the
storage device is determined by the layout's type. The client needs storage device is determined by the layout's type. The client is
to select a layout driver that understands how to interpret and use responsible for matching the layout type with an available method to
that layout. The method for layout driver selection used by the interpret and use the layout. The method for this layout type
client is outside the scope of the pNFS extension. selection is outside the scope of the pNFS functionality.
Although the metadata server is in control of the layout for a file, Although the metadata server is in control of the layout for a file,
the pNFS client can provide hints to the server when a file is opened the pNFS client can provide hints to the server when a file is opened
or created about the preferred layout type and aggregation schemes. or created about the preferred layout type and aggregation schemes.
PNFS introduces a layout_hint (Section 5.13.4) attribute that the pNFS introduces a layout_hint (Section 5.13.4) attribute that the
client can set at file creation time to provide a hint to the server client can set at file creation time to provide a hint to the server
for new files. Setting this attribute separately, after the file has for new files. Setting this attribute separately, after the file has
been created could make it difficult, or impossible, for the server been created might make it difficult, or impossible, for the server
implementation to comply. This in turn further complicates the implementation to comply. This further complicates the exclusive
exclusive file creation via OPEN, which when done via the EXCLUSIVE4 file creation via OPEN, which when done via the EXCLUSIVE4 createmode
createmode does not allow the setting of attributes at file creation does not allow the setting of attributes at file creation time.
time. However as noted in Section 17.16.4, if the server supports a However as noted in Section 17.16.4, if the server supports a
persistent reply cache, the EXCLUSIVE4 createmode is not needed. persistent reply cache, the EXCLUSIVE4 createmode is not needed.
Therefore, a metadata server that supports the layout_hint attribute Therefore, a metadata server that supports the layout_hint attribute
MUST support a persistent session reply cache, and a pNFS client that MUST support a persistent session reply cache, and a pNFS client that
wants to set layout_hint at file creation (OPEN) time MUST NOT use wants to set layout_hint at file creation (OPEN) time MUST NOT use
the EXCLUSIVE4 createmode, and instead MUST used GUARDED for an the EXCLUSIVE4 createmode, and instead MUST used GUARDED for an
exclusive regular file creation. exclusive regular file creation.
12.5.3. Committing a Layout 12.5.3. Committing a Layout
Due to the nature of the protocol, the file attributes, and data Allowing for varying storage protocols capabilities, the pNFS
location mapping (e.g., which offsets store data versus store holes, protocol does not require the metadata server and storage devices to
see Section 13.5) information that exists on the metadata server may have a consistent view of file attributes and data location mappings.
become inconsistent in relation to the data stored on the storage Data location mapping refers to things like which offsets store data
devices; e.g., when WRITEs occur before a layout has been committed as opposed to storing holes (see Section 13.5 for a discussion).
(e.g., between a LAYOUTGET and a LAYOUTCOMMIT). Thus, it is Related issues arise for storage protocols where a layout may hold
necessary to occasionally re-synchronized this state and make it provisionally allocated blocks where the allocation of those blocks
visible to other clients through the metadata server. does not survive a complete restart of both the client and server.
Because of this inconsistency, it is necessary to re-synchronize the
client with the metadata server and its storage devices and make any
potential changes available to other clients. This is accomplished
by use of the LAYOUTCOMMIT operation.
The LAYOUTCOMMIT operation is responsible for committing a modified The LAYOUTCOMMIT operation is responsible for committing a modified
layout segment to the metadata server. Note: the data should be layout to the metadata server. The data should be written and
written and committed to the appropriate storage devices before the committed to the appropriate storage devices before the LAYOUTCOMMIT
LAYOUTCOMMIT occurs. Note, if the data is being written occurs. If the data is being written asynchronously through the
asynchronously (i.e., if using NFSv4.1 as the storage protocol, the metadata server, a COMMIT to the metadata server is required to
field committed in WRITE4resok is UNSTABLE4) through the metadata synchronize the data and make it visible on the storage devices (see
server, a COMMIT to the metadata server is required to synchronize Section 12.5.5 for more details). The scope of the LAYOUTCOMMIT
the data and make it visible on the storage devices (see operation depends on the storage protocol in use. It is important to
Section 12.5.5 for more details). The scope of this operation note that the level of synchronization is from the point of view of
depends on the storage protocol in use. For block/volume-based the client which issued the LAYOUTCOMMIT. The updated state on the
layouts, it may require updating the block list that comprises the metadata server need only reflect the state as of the client's last
file and committing this layout to stable storage. Whereas, for operation previous to the LAYOUTCOMMIT. It is not REQUIRED to
file-layouts it requires some synchronization of attributes between maintain a global view that accounts for other clients' I/O that may
the metadata and storage devices (i.e., mainly the size attribute: have occurred within the same time frame.
EOF). It is important to note that the level of synchronization is
from the point of view of the client which issued the LAYOUTCOMMIT. For block/volume-based layouts, the LAYOUTCOMMIT may require updating
The updated state on the metadata server need only reflect the state the block list that comprises the file and committing this layout to
as of the client's last operation previous to the LAYOUTCOMMIT, it stable storage. For file-layouts synchronization of attributes
need not reflect a globally synchronized state (e.g., other clients between the metadata and storage devices primarily the size attribute
may be performing, or may have performed I/O since the client's last is required.
operation and the LAYOUTCOMMIT).
The control protocol is free to synchronize the attributes before it The control protocol is free to synchronize the attributes before it
receives a LAYOUTCOMMIT, however upon successful completion of a receives a LAYOUTCOMMIT, however upon successful completion of a
LAYOUTCOMMIT, state that exists on the metadata server that describes LAYOUTCOMMIT, state that exists on the metadata server that describes
the file MUST be in sync with the state existing on the storage the file MUST be in sync with the state existing on the storage
devices that comprise that file as of the issuing client's last devices that comprise that file as of the issuing client's last
operation. Thus, a client that queries the size of a file between a operation. Thus, a client that queries the size of a file between a
WRITE to a storage device and the LAYOUTCOMMIT may observe a size WRITE to a storage device and the LAYOUTCOMMIT may observe a size
that does not reflects the actual data written. that does not reflect the actual data written.
12.5.3.1. LAYOUTCOMMIT and mtime/atime/change 12.5.3.1. LAYOUTCOMMIT and change/time_modify/time_change
The change attribute and the modify/access times may be updated, by The change, time_modify, and time_access attributes may be updated by
the server, at LAYOUTCOMMIT time; since for some layout types, the the server when the LAYOUTCOMMIT operation is processed. The reason
change attribute and atime/mtime cannot be updated by the appropriate for this is that some layout types do not support the update of these
I/O operation performed at a storage device. The arguments to attributes when the storage devices process I/O operations. The
LAYOUTCOMMIT allow the client to provide suggested access and modify client is capable providing suggested values to the server for
time values to the server. Again, depending upon the layout type, time_access and time_modify with the arguments to LAYOUTCOMMIT.
these client provided values may or may not be used. The server Based on layout type, the provided values may or may not be used.
should sanity check the client provided values before they are used. The server should sanity check the client provided values before they
For example, the server should ensure that time does not flow are used. For example, the server should ensure that time does not
backwards. According to the NFSv4 specification, The client always flow backwards. The client always has the option to set these
has the option to set these attributes through an explicit SETATTR attributes through an explicit SETATTR operation.
operation.
As mentioned, for some layout protocols the change attribute and For some layout protocols, the storage device is able to notify the
mtime/atime may be updated at or after the time the I/O occurred metadata server of the occurance of an I/O and as a result the
(e.g., if the storage device is able to communicate these attributes change, time_modify, and time_access attributes may be updated at the
to the metadata server). If, upon receiving a LAYOUTCOMMIT, the metadata server. For a metadata server that is capable of monitoring
server implementation is able to determine that the file did not updates to the change, time_modify, and time_access attributes,
change since the last time the change attribute was updated (e.g., no LAYOUTCOMMIT processing is not required to update the change
WRITEs or over-writes occurred), the implementation need not update attribute; in this case the metadata server must ensure that no
the change attribute; file-based protocols may have enough state to further update to the data has occurred since the last update of the
make this determination or may update the change attribute upon each attributes; file-based protocols may have enough information to make
file modification. This also applies for mtime and atime; if the this determination or may update the change attribute upon each file
server implementation is able to determine that the file has not been modification. This also applies for the time_modify and time_access
modified since the last mtime update, the server need not update attributes. If the server implementation is able to determine that
mtime at LAYOUTCOMMIT time. Once LAYOUTCOMMIT completes, the new the file has not been modified since the last time_modify update, the
change attribute and mtime/atime should be visible if that file was server need not update time_modify at LAYOUTCOMMIT. At LAYOUTCOMMIT
completion, the updated attributes should be visible if that file was
modified since the latest previous LAYOUTCOMMIT or LAYOUTGET. modified since the latest previous LAYOUTCOMMIT or LAYOUTGET.
12.5.3.2. LAYOUTCOMMIT and size 12.5.3.2. LAYOUTCOMMIT and size
The file's size may be updated at LAYOUTCOMMIT time as well. The The size of a file may be updated when the LAYOUTCOMMIT operation is
LAYOUTCOMMIT argument contains a field, loca_last_write_offset, that used by the client. One of the fields in the argument to
indicates the highest octet offset written but not yet committed via LAYOUTCOMMIT is loca_last_write_offset; this field indicates the
LAYOUTCOMMIT. Note: this argument is switched on a boolean value highest octet offset written but not yet committed with the
(field no_newoffset) indicating whether or not a previous write LAYOUTCOMMIT operation. The data type of lora_last_write_offset is
occurred. If no_newoffset is FALSE, no loca_last_write_offset is newoffset4 and is switched on a boolean value, no_newoffset, that
given. A loca_last_write_offset specifying an offset of 0 means indicates if a previous write occurred or not. If no_newoffset is
octet 0 was the highest last octet written. FALSE, an offset is not given. A loca_last_write_offset value of
zero means that one byte was written at offset zero.
The metadata server may do one of the following: The metadata server may do one of the following:
1. It may update the file's size based on the last write offset. 1. Update the file's size using the last write offset provided by
However, to the extent possible, the metadata server should the client as either the true file size or as a hint of the file
sanity check any value to which the file's size is going to be size. If the metadata server has a method available, any new
set. E.g., it must not truncate the file based on the client value for file size should be sanity checked. For example, the
presenting a smaller last write offset than the file's current file must not be truncated if the client presents a last write
size. offset less than the file's current size.
2. If it has sufficient other knowledge of file size (e.g., by
querying the storage devices through the control protocol), it
may ignore the client provided argument and use the query-derived
value.
3. It may use the last write offset as a hint, subject to correction 2. Ignore the client provided last write offset; the metadata server
when other information is available as above. must have sufficient knowledge from other sources to determine
the file's size. For example, the metadata server queries the
storage devices with the control protocol.
The method chosen to update the file's size will depend on the The method chosen to update the file's size will depend on the
storage device's and/or the control protocol's implementation. For storage device's and/or the control protocol's capabilities. For
example, if the storage devices are block devices with no knowledge example, if the storage devices are block devices with no knowledge
of file size, the metadata server must rely on the client to set the of file size, the metadata server must rely on the client to set the
size appropriately. A new size flag and length are also returned in last write offset appropriately.
the results of a LAYOUTCOMMIT. This union indicates whether a new
size was set, and to what length it was set. If a new size is set as The results of LAYOUTCOMMIT contain a new size value in the form of a
a result of LAYOUTCOMMIT, then the metadata server must reply with newsize4 union data type. If the file's size is set as a result of
the new size. As well, if the size is updated, the metadata server LAYOUTCOMMIT, the metadata server must reply with the new size;
in conjunction with the control protocol SHOULD ensure that the new otherwise the new size is not provided. If the file size is updated,
size is reflected by the storage devices immediately upon return of the metadata server SHOULD update the storage devices such that the
the LAYOUTCOMMIT operation; e.g., a READ up to the new file size new file size is reflected when LAYOUTCOMMIT processing is complete.
should succeed on the storage devices (assuming no intervening For example, the client should be able to READ up to the new file
truncations). Again, if the client wants to explicitly zero-extend size.
or truncate a file, SETATTR must be used; it need not be used when
If the client wants to explicitly zero-extend or truncate a file, the
SETATTR operation MUST be used; SETATTR use is not required when
simply writing past EOF via WRITE. simply writing past EOF via WRITE.
12.5.3.3. LAYOUTCOMMIT and layoutupdate 12.5.3.3. LAYOUTCOMMIT and layoutupdate
The LAYOUTCOMMIT argument contains a loca_layoutupdate field The LAYOUTCOMMIT argument contains a loca_layoutupdate field
(Section 17.42.2) of data type layoutupdate4 (Section 3.2.21). This (Section 17.42.2) of data type layoutupdate4 (Section 3.2.20). This
argument is a layout type-specific structure. The structure can be argument is a layout type-specific structure. The structure can be
used to pass arbitrary layout type-specific information from the used to pass arbitrary layout type-specific information from the
client to the metadata server at LAYOUTCOMMIT time. For example, if client to the metadata server at LAYOUTCOMMIT time. For example, if
using a block/volume layout, the client can indicate to the metadata using a block/volume layout, the client can indicate to the metadata
server which reserved or allocated blocks the client used and did not server which reserved or allocated blocks the client used or did not
use. The content of loca_layoutupdate (field lou_body) need not be use. The content of loca_layoutupdate (field lou_body) need not be
the same the layout type-specific content returned by LAYOUTGET the same layout type-specific content returned by LAYOUTGET
(Section 17.43.3) in the loc_body field of the lo_content field, of (Section 17.43.3) in the loc_body field of the lo_content field, of
the logr_layout field. The content of loca_layoutupdate is defined the logr_layout field. The content of loca_layoutupdate is defined
by the layout type specification and is opaque to LAYOUTCOMMIT. by the layout type specification and is opaque to LAYOUTCOMMIT.
12.5.4. Recalling a Layout 12.5.4. Recalling a Layout
Since a layout protects a client's access to a file via a direct Since a layout protects a client's access to a file via a direct
client-storage-device path, a layout need only be recalled when it is client-storage-device path, a layout need only be recalled when it is
semantically unable to serve this function. Typically, this occurs semantically unable to serve this function. Typically, this occurs
when the layout no longer encapsulates the true location of the file when the layout no longer encapsulates the true location of the file
over the octet range it represents. Any operation or action (e.g., over the octet range it represents. Any operation or action, such as
server driven restriping or load balancing) that changes the layout server driven restriping or load balancing, that changes the layout
will result in a recall of the layout. A layout is recalled by the will result in a recall of the layout. A layout is recalled by the
CB_LAYOUTRECALL callback operation (see Section 19.3). This callback CB_LAYOUTRECALL callback operation (see Section 19.3) and returned
can either recall a layout segment identified by a octet range, all with LAYOUTRETURN Section 17.44. The CB_LAYOUTRECALL operation may
the layouts associated with a file system (FSID), or all layouts. recall a layout identified by a octet range, all the layouts
Recalling all layouts or all the layouts associated with a file associated with a file system (FSID), or all layouts associated with
system also invalidates the client's device cache for the affected a client ID. Recalling all layouts associated with a client ID or
file systems. Multiple layout segments may be returned in a single all the layouts associated with a file system also invalidates the
compound operation. Section 12.5.4.2 discusses sequencing issues client's device cache for the affected file systems.
surrounding the getting, returning, and recalling of layouts. Section 12.5.4.2 discusses sequencing issues surrounding the getting,
returning, and recalling of layouts.
The iomode is also specified when recalling a layout or layout An iomode is also specified when recalling a layout. Generally, the
segment. Generally, the iomode in the recall request must match the iomode in the recall request must match the layout being returned;
layout, or segment, being returned; e.g., a recall with an iomode of for example, a recall with an iomode of LAYOUTIOMODE4_RW should cause
LAYOUTIOMODE4_RW should cause the client to only return the client to only return LAYOUTIOMODE4_RW layouts and not
LAYOUTIOMODE4_RW layout segments (not LAYOUTIOMODE4_READ layouts. However, a special LAYOUTIOMODE4_ANY
LAYOUTIOMODE4_REALAYOUTIOMODE4_READ segments). However, a special enumeration is defined to enable recalling a layout of any iomode; in
LAYOUTIOMODE4_ANY enumeration is defined to enable recalling a layout other words, the client must return both read-only and read/write
of any type (i.e., the client must return both read-only and read/ layouts.
write layouts).
A REMOVE operation may cause the metadata server to recall the layout A REMOVE operation SHOULD cause the metadata server to recall the
to prevent the client from accessing a non-existent file and to layout to prevent the client from accessing a non-existent file and
reclaim state stored on the client. Since a REMOVE may be delayed to reclaim state stored on the client. Since a REMOVE may be delayed
until the last close of the file has occurred, the recall may also be until the last close of the file has occurred, the recall may also be
delayed until this time. As well, once the file has been removed, delayed until this time. After the last reference on the file has
after the last reference, the client SHOULD no longer be able to been released and the file has been removed, the client should no
perform I/O using the layout (e.g., with file-based layouts an error longer be able to perform I/O using the layout. In the case of a
such as ESTALE could be returned). files based layout, the pNFS server SHOULD return NFS4ERR_STALE for
the removed file.
Although pNFS does not alter the caching capabilities of clients, or
their semantics, it recognizes that some clients may perform more
aggressive write-behind caching to optimize the benefits provided by
pNFS. However, write-behind caching may impact the latency in
returning a layout in response to a CB_LAYOUTRECALL; just as caching
impacts DELEGRETURN with regards to data delegations. Client
implementations should limit the amount of unwritten data they have
outstanding at any one time. Server implementations may fence
clients from performing direct I/O to the storage devices if they
perceive that the client is taking too long to return a layout once
recalled. A server may be able to monitor client progress by
watching client I/Os or by observing LAYOUTRETURNs of sub-portions of
the recalled layout. The server can also limit the amount of dirty
data to be flushed to storage devices by limiting the octet ranges
covered in the layouts it gives out.
Once a layout has been returned, the client MUST NOT issue I/Os to Once a layout has been returned, the client MUST NOT issue I/Os to
the storage devices for the file, octet range, and iomode represented the storage devices for the file, octet range, and iomode represented
by the returned layout. If a client does issue an I/O to a storage by the returned layout. If a client does issue an I/O to a storage
device for which it does not hold a layout, the storage device SHOULD device for which it does not hold a layout, the storage device SHOULD
reject the I/O. reject the I/O.
Although pNFS does not alter the file data caching capabilities of
clients, or their semantics, it recognizes that some clients may
perform more aggressive write-behind caching to optimize the benefits
provided by pNFS. However, write-behind caching may negatively
affect the latency in returning a layout in response to a
CB_LAYOUTRECALL; this is similar file delegations and the impact that
file data caching has on DELEGRETURN. Client implementations SHOULD
limit the amount of unwritten data they have outstanding at any one
time in order to prevent excessively long responses to
CB_LAYOUTRECALL. Once a layout is recalled, a server MUST wait one
lease period before taking further action. As soon as a lease period
has past, the server may choose to fence the client's access to the
storage devices if the server perceives the client has taken too long
to return a LAYOUT; However, just as in the case of data delegation
and DELEGRETURN, the server may choose to wait given that the client
is showing forward progress on its way to returning the layout. This
forward progress can take the form of successful interaction with the
storage devices or sub-portions of the layout being returned by the
client. The server can also limit exposure to these problems by
limiting the octet ranges initially provided in the layouts and thus
the amount of outstanding modified data.
12.5.4.1. Recall Callback Robustness 12.5.4.1. Recall Callback Robustness
It has been assumed thus far that pNFS client state for a file It has been assumed thus far that pNFS client state for a file
exactly matches the pNFS server state for that file and client exactly matches the pNFS server state for that file and client
regarding layout ranges and permissions. This assumption leads to regarding layout ranges and iomode. This assumption leads to the
the implication that any callback results in a LAYOUTRETURN or set of implication that any callback results in a LAYOUTRETURN or set of
LAYOUTRETURNs that exactly match the range in the callback, since LAYOUTRETURNs that exactly match the range in the callback, since
both client and server agree about the state being maintained. both client and server agree about the state being maintained.
However, it can be useful if this assumption does not always hold. However, it can be useful if this assumption does not always hold.
For example: For example:
o It may be useful for clients to be able to discard layout o It may be useful for clients to be able to discard layout
information without calling LAYOUTRETURN. If conflicts that information without calling LAYOUTRETURN. If conflicts that
require callbacks are very rare, and a server can use a multi-file require callbacks are very rare, and a server can use a multi-file
callback to recover per-client resources (e.g., via a FSID recall, callback to recover per-client resources (e.g., via a FSID recall,
or a multi-file recall within a single compound), the result may or a multi-file recall within a single compound), the result may
skipping to change at page 248, line 11 skipping to change at page 248, line 51
about what ranges are held by a client on a coarse-grained basis, about what ranges are held by a client on a coarse-grained basis,
leading to the server's layout ranges being beyond those actually leading to the server's layout ranges being beyond those actually
held by the client. In the extreme, a server could manage held by the client. In the extreme, a server could manage
conflicts on a per-file basis, only issuing whole-file callbacks conflicts on a per-file basis, only issuing whole-file callbacks
even though clients may request and be granted sub-file ranges. even though clients may request and be granted sub-file ranges.
o In order to avoid errors, it is vital that a client not assign o In order to avoid errors, it is vital that a client not assign
itself layout permissions beyond what the server has granted and itself layout permissions beyond what the server has granted and
that the server not forget layout permissions that have been that the server not forget layout permissions that have been
granted. On the other hand, if a server believes that a client granted. On the other hand, if a server believes that a client
holds a layout segment that the client does not know about, it's holds a layout that the client does not know about, it is useful
useful for the client to cleanly indicate completion of the for the client to cleanly indicate completion of the requested
requested recall either by issuing a LAYOUTRETURN for the entire recall either by issuing a LAYOUTRETURN for the entire requested
requested range or by returning an NFS4ERR_NOMATCHING_LAYOUT error range or by returning an NFS4ERR_NOMATCHING_LAYOUT error to the
to the layout recall callback. CB_LAYOUTRECALL.
Thus, in light of the above, it is useful for a server to be able to Thus, in light of the above, it is useful for a server to be able to
issue callbacks for layout ranges it has not granted to a client, and issue callbacks for layout ranges it has not granted to a client, and
for a client to return ranges it does not hold. A pNFS client must for a client to return ranges it does not hold. A pNFS client MUST
always return layout segments that comprise the full range specified always return layouts that comprise the full range specified by the
by the recall. Note, the full recalled layout range need not be recall. Note, the full recalled layout range need not be returned as
returned as part of a single operation, but may be returned in part of a single operation, but may be returned in portions. This
segments. This allows the client to stage the flushing of dirty allows the client to stage the flushing of dirty data, layout
data, layout commits, and returns. Also, it indicates to the commits, and returns. Also, it indicates to the metadata server that
metadata server that the client is making progress. the client is making progress.
It is possible that write requests may be presented to a storage When a layout is returned, the client MUST NOT have any outstanding
device no longer allowed to perform them. This behavior is limited write requests to the storage devices involved in the layout.
by requiring that a client MUST wait for completion of all writes Rephrasing, the client MUST NOT return the layout while it has
covered by a layout range before returning a layout that covers that outstanding write requests to the storage device.
range. Since the server has no control as to when the client will
return the layout, the server may later decide to unilaterally revoke
the client's access provided by the layout in question. Upon doing
so the server must deal with the possibility of lingering writes,
outstanding writes still in flight to data servers identified by the
revoked layout. Each layout-specification MUST define whether
unilateral layout revocation by the metadata server is supported, and
if so, the specification must also outline how lingering writes are
to be dealt with; e.g., storage devices identified by the revoked
layout in question could be fenced off from the appropriate client.
If unilateral revocation is not supported, there MUST be no
possibility that the client has outstanding write requests when a
layout is returned.
In order to ensure client/server convergence on the layout state, the Even with this requirement for the client, it is possible that write
final LAYOUTRETURN operation in a sequence of LAYOUTRETURN operations requests may be presented to a storage device no longer allowed to
for a particular recall, MUST specify the entire range being perform them. Since the server has no strict control as to when the
recalled, echoing the recalled layout type, iomode, recall/return client will return the layout, the server may later decide to
type (FILE, FSID, or ALL), and octet range; even if layout segments unilaterally revoke the client's access provided by the layout. In
choosing to revoke access, the server must deal with the possibility
of lingering writes; those outstanding writes still in flight to
storage servers identified by the revoked layout. Each layout
specification MUST define whether unilateral layout revocation by the
metadata server is supported; if it is, the specification must also
describe how lingering writes are processed. For example, storage
devices identified by the revoked layout could be fenced off from the
client that held the layout.
In order to ensure client/server convergence with regard to layout
state, the final LAYOUTRETURN operation in a sequence of LAYOUTRETURN
operations for a particular recall, MUST specify the entire range
being recalled, echoing the recalled layout type, iomode, recall/
return type (FILE, FSID, or ALL), and octet range; even if layouts
pertaining to partial ranges were previously returned. In addition, pertaining to partial ranges were previously returned. In addition,
if the client holds no layout segment that overlaps the range being if the client holds no layouts that overlaps the range being
recalled, the client should return the NFS4ERR_NOMATCHING_LAYOUT recalled, the client should return the NFS4ERR_NOMATCHING_LAYOUT
error code. This allows the server to update its view of the error code to CB_LAYOUTRECALL. This allows the server to update its
client's layout state. view of the client's layout state.
12.5.4.2. Serialization of Layout Operations 12.5.4.2. Serialization of Layout Operations
As with other stateful operations, pNFS requires the correct As with other stateful operations, pNFS requires the correct
sequencing of layout operations. PNFS uses the sessions feature of sequencing of layout operations. pNFS uses the sessions feature of
NFSv4.1 to provide the correct sequencing between regular operations NFSv4.1 to provide the correct sequencing between regular operations
and callbacks. It is the server's responsibility to avoid and callbacks. It is the server's responsibility to avoid
inconsistencies regarding the layouts it hands out and the client's inconsistencies regarding the layouts provided and the client's
responsibility to properly serialize its layout requests and layout responsibility to properly serialize its layout requests and layout
returns. returns.
12.5.4.2.1. Get/Return Serialization 12.5.4.2.1. Get/Return Serialization
The protocol allows the client to send concurrent LAYOUTGET and The protocol allows the client to send concurrent LAYOUTGET and
LAYOUTRETURN operations to the server. However, the protocol does LAYOUTRETURN operations to the server. However, the protocol does
not provide any means for the server to process the requests in the not provide any means for the server to process the requests in the
same order in which they were created, nor does it provide a way for same order in which they were created, nor does it provide a way for
the client to determine the order in which parallel outstanding the client to determine the order in which parallel outstanding
operations were processed by the server. Thus, when a layout segment operations were processed by the server. Thus, when a layout
retrieved by an outstanding LAYOUTGET operation intersects with a retrieved by an outstanding LAYOUTGET operation intersects with a
layout segment returned by an outstanding LAYOUTRETURN the order in layout returned by an outstanding LAYOUTRETURN the order in which the
which the two conflicting operations are processed determines the two conflicting operations are processed determines the final state
final state of the overlapping segment. To disambiguate between the of the overlapping layout. To disambiguate between the two cases the
two cases the client MUST serialize LAYOUTGET operations and client MUST serialize LAYOUTGET operations and voluntary LAYOUTRETURN
voluntary LAYOUTRETURN operations for the same file. operations for the same file.
It is permissible for the client to send in parallel multiple It is permissible for the client to send in parallel multiple
LAYOUTGET operations for the same file or multiple LAYOUTRETURN LAYOUTGET operations for the same file or multiple LAYOUTRETURN
operations for the same file; but never a mix of both. It is also operations for the same file; but never a mix of both. It is also
permissible for the client to combine LAYOUTRETURN and LAYOUTGET permissible for the client to combine LAYOUTRETURN and LAYOUTGET
operations for the same file in the same COMPOUND request as the operations for the same file in the same COMPOUND request since the
server MUST process these in order. If a client does issue such server MUST process these in order. If a client does issue such
requests, it MUST NOT have more than one outstanding for the same requests, it MUST NOT have more than one outstanding for the same
file at the same time and MUST NOT have other LAYOUTGET or file at the same time and MUST NOT have other LAYOUTGET or
LAYOUTRETURN operations outstanding at the same time for that same LAYOUTRETURN operations outstanding at the same time for that same
file. file.
12.5.4.2.2. Recall/Return Sequencing 12.5.4.2.2. Recall/Return Sequencing
One critical issue with operation sequencing concerns callbacks. The One critical issue with regard to operation sequencing concerns
protocol must defend against races between the reply to a LAYOUTGET callbacks. The protocol must defend against races between the reply
operation and a subsequent CB_LAYOUTRECALL. A client MUST NOT to a LAYOUTGET operation and a subsequent CB_LAYOUTRECALL. A client
process a CB_LAYOUTRECALL that identifies an outstanding LAYOUTGET MUST NOT process a CB_LAYOUTRECALL that identifies an outstanding
operation to which the client has not yet received a reply. LAYOUTGET operation to which the client has not yet received a reply.
Conflicting LAYOUTGET operations are identified in the CB_SEQUENCE Intersecting LAYOUTGET operations are identified in the CB_SEQUENCE
preceding the CB_LAYOUTRECALL. preceding the CB_LAYOUTRECALL.
The callback races section (Section 2.10.5.3) describes the sessions The callback races section (Section 2.10.5.3) describes the sessions
mechanism for allowing the client to detect such situations in order mechanism for allowing the client to detect such situations in order
to not process such a CB_LAYOUTRECALL. The server MUST reference all to delay processing such a CB_LAYOUTRECALL. The server MUST
conflicting LAYOUTGET operations in the CB_SEQUENCE that precedes the reference all conflicting LAYOUTGET operations in the CB_SEQUENCE
CB_LAYOUTRECALL. A zero length array of referenced operations is that precedes the CB_LAYOUTRECALL. A zero length array of referenced
used by the server to tell the client that the server does not know operations is used by the server to tell the client that the server
of any LAYOUTGET operations that conflict with the recall. does not know of any LAYOUTGET operations that conflict with the
recall.
12.5.4.2.2.1. Client Side Considerations 12.5.4.2.2.1. Client Considerations
Consider a pNFS client that has issued a LAYOUTGET and then receives Consider a pNFS client that has issued a LAYOUTGET and then receives
an overlapping recall callback for the same file. There are two an overlapping CB_LAYOUTRECALL for the same file. There are two
possibilities, which the client would be unable to distinguish possibilities, which the client would be unable to distinguish
without additional information provided by the sessions without additional information provided by the sessions
implementation. implementation.
1. The server processed the LAYOUTGET before issuing the recall, so 1. The server processed the LAYOUTGET before issuing the recall, so
the LAYOUTGET response is in flight, and must be waited for the LAYOUTGET response is in flight, and must be waited for
because it may be carrying layout info that will need to be because it may be carrying layout info that will need to be
returned to deal with the recall callback. returned to deal with the CB_LAYOUTRECALL.
2. The server issued the callback before receiving the LAYOUTGET. 2. The server issued the callback before receiving the LAYOUTGET.
The server will not respond to the LAYOUTGET until the recall The server will not respond to the LAYOUTGET until the
callback is processed. CB_LAYOUTRECALL is processed.
These possibilities could cause deadlock, as the client must wait for These possibilities could cause deadlock, as the client must wait for
the LAYOUTGET response before processing the recall in the first the LAYOUTGET response before processing the recall in the first
case, but that response will not arrive until after the recall is case, but that response will not arrive until after the recall is
processed in the second case. Via the CB_SEQUENCE operation, the processed in the second case. Via the CB_SEQUENCE operation, the
server provides the client with the { slotid , sequence id } of any server provides the client with the { sessionid, slotid , sequence id
earlier LAYOUTGET operations which remain unconfirmed at the server } of any earlier LAYOUTGET operations which remain unconfirmed at the
by the session slot usage rules. This allows the client to server by the session slot usage rules. This allows the client to
disambiguate between the two cases, in case 1, the server will disambiguate between the two cases. In case 1 the server will
provide the operation reference(s), whereas in case 2 it will not provide the operation reference(s); whereas in case 2 it will not
(because there are no dependent client operations). Therefore, the because there are no dependent client operations. Therefore, the
action at the client will only require waiting in the case that the action at the client will only require waiting in the case that the
client has not yet seen the server's earlier responses to the client has not yet seen the server's earlier responses to the
LAYOUTGET operation(s). LAYOUTGET operation(s).
The following requirements apply to avoid this deadlock: by adhering The following requirements apply to avoid this deadlock: by adhering
to the following requirements: to the following requirements:
o A LAYOUTGET MUST be rejected with the error NFS4ERR_RECALLCONFLICT o A LAYOUTGET MUST be rejected with the error NFS4ERR_RECALLCONFLICT
if there's an overlapping outstanding recall callback to the same if there is an overlapping outstanding CB_LAYOUTRECALL to the same
client. client.
o When processing a recall, the client MUST wait for a response to o When processing a recall, the client MUST wait for a response to
all conflicting outstanding LAYOUTGETs that are referenced in the all conflicting outstanding LAYOUTGETs that are referenced in the
CB_SEQUENCE for the recall before performing any RETURN that could CB_SEQUENCE for the recall before performing any LAYOUTRETURN that
be affected by any such response. could be affected by any such response.
o The client SHOULD wait for responses to all operations required to o The client SHOULD wait for responses to all operations required to
complete a recall before sending any LAYOUTGETs that would complete a recall before sending any LAYOUTGETs that would
conflict with the recall because the server is likely to return conflict with the recall because the server is likely to return
errors for them. errors for them (see the first item above).
o Before sending a new LAYOUTGET for a range covered by a layout o Before sending a new LAYOUTGET for a range covered by a layout
recall, the client SHOULD wait for responses to any outstanding recall, the client SHOULD wait for responses to any outstanding
LAYOUTGET that overlaps any portion of the new LAYOUTGET's range . LAYOUTGET that overlaps any portion of the new LAYOUTGET's range .
This is because it is possible (although unlikely) that the prior This is because it is possible (although unlikely) that the prior
operation may have arrived at the server after the recall operation may have arrived at the server after the recall
completed and hence will succeed. completed and hence will succeed.
o The recall process can be considered as done by the client when o The recall process can be considered completed, by the client,
the final LAYOUTRETURN operation for the recalled range is issued. when the final LAYOUTRETURN operation for the recalled range is
completed.
12.5.4.2.2.2. Server Side Considerations 12.5.4.2.2.2. Server Considerations
Consider a related situation from the metadata server's point of Consider a related situation from the metadata server's point of
view. The metadata server has issued a recall layout callback and view. The metadata server has issued a CB_LAYOUTRECALL and receives
receives an overlapping LAYOUTGET for the same file before the an overlapping LAYOUTGET for the same file before the LAYOUTRETURN(s)
LAYOUTRETURN(s) that respond to the recall callback. Again, there that respond to the CB_LAYOUTRECALL. Again, there are two cases:
are two cases:
1. The client issued the LAYOUTGET before processing the recall
callback.
2. The client issued the LAYOUTGET after processing the recall 1. The client issued the LAYOUTGET before processing the
callback, but it arrived before the LAYOUTRETURN that completed CB_LAYOUTRECALL.
that processing.
The metadata server MUST reject the overlapping LAYOUTGET. The 2. The client issued the LAYOUTGET after processing the
client has two ways to avoid this result - it can issue the LAYOUTGET CB_LAYOUTRECALL, but it arrived before the LAYOUTRETURN that
as a subsequent element of a COMPOUND containing the LAYOUTRETURN completed that processing.
that completes the recall callback, or it can wait for the response
to that LAYOUTRETURN.
There is little the session sequence logic can do to disambiguate There is nothing the session sequence logic can do to disambiguate
between these two cases, because both operations are independent of between these two cases because both operations are independent of
one another. They are simply asynchronous events which crossed. The one another. They are simply asynchronous events which crossed. The
situation can even occur if the session is configured to use a single situation can even occur if the session is configured to use a single
connection for both operations and callbacks. connection for both operations and callbacks.
Given no method to disambiguate these cases the metadata server MUST
reject the overlapping LAYOUTGET with the error
NFS4ERR_RECALLCONFLICT. The client has two ways to avoid this
result. It can issue the LAYOUTGET as a subsequent element of a
COMPOUND containing the LAYOUTRETURN that completes the
CB_LAYOUTRECALL or it can wait for the response to that LAYOUTRETURN.
12.5.5. Metadata Server Write Propagation 12.5.5. Metadata Server Write Propagation
Asynchronous writes written through the metadata server may be Asynchronous writes written through the metadata server may be
propagated lazily to the storage devices. For data written propagated lazily to the storage devices. For data written
asynchronously through the metadata server, a client performing a asynchronously through the metadata server, a client performing a
read at the appropriate storage device is not guaranteed to see the read at the appropriate storage device is not guaranteed to see the
newly written data until a COMMIT occurs at the metadata server. newly written data until a COMMIT occurs at the metadata server.
While the write is pending, reads to the storage device can give out While the write is pending, reads to the storage device may give out
either the old data, the new data, or a mixture thereof. After either the old data, the new data, or a mixture of new and old. Upon
either a synchronous write completes, or a COMMIT is received (for completion of a synchronous WRITE or COMMIT (for asynchronously
asynchronously written data), the metadata server must ensure that written data), the metadata server MUST ensure that storage devices
storage devices give out the new data and that the data has been give out the new data and that the data has been written to stable
written to stable storage. If the server implements its storage in storage. If the server implements its storage in any way such that
any way such that it cannot obey these constraints, then it must it cannot obey these constraints, then it MUST recall the layouts to
recall the layouts to prevent reads being done that cannot be handled prevent reads being done that cannot be handled correctly. Note that
correctly. the layouts MUST be recalled prior to the server responding to the
associated WRITE operations.
12.6. PNFS Mechanics 12.6. pNFS Mechanics
This section describes the operations flow taken by a pNFS client to This section describes the operations flow taken by a pNFS client to
a metadata server and storage device. a metadata server and storage device.
When a pNFS client encounters a new FSID, it issues a GETATTR to the When a pNFS client encounters a new FSID, it issues a GETATTR to the
NFSv4.1 server for the fs_layout_type (Section 5.13.1) attribute. If NFSv4.1 server for the fs_layout_type (Section 5.13.1) attribute. If
the attribute returns at least one layout type, and the layout the attribute returns at least one layout type, and the layout types
type(s) returned is(are) among the set supported by the client, the returned are among the set supported by the client, the client knows
client knows that pNFS is a possibility for the filesystem. If, from that pNFS is a possibility for the file system. If, from the server
the server that returned the new FSID, the client does not have a that returned the new FSID, the client does not have a client ID that
client ID that came from an EXCHANGE_ID result that returned came from an EXCHANGE_ID result that returned
EXCHGID4_FLAG_USE_PNFS_MDS, it must send an EXCHANGE_ID to the server EXCHGID4_FLAG_USE_PNFS_MDS, it must send an EXCHANGE_ID to the server
with the EXCHGID4_FLAG_USE_PNFS_MDS bit set. If the server's with the EXCHGID4_FLAG_USE_PNFS_MDS bit set. If the server's
response does not have EXCHGID4_FLAG_USE_PNFS_MDS, then contrary to response does not have EXCHGID4_FLAG_USE_PNFS_MDS, then contrary to
what the fs_layout_type attribute said, the server does not support what the fs_layout_type attribute said, the server does not support
pNFS, and the client will not be able use pNFS to that server. pNFS, and the client will not be able use pNFS to that server; in
this case, the server should return NFS4ERR_NOTSUPP in response to
any pNFS operation.
Once the client has a client ID that supports pNFS, it creates a Once the client has a client ID that supports pNFS, it creates a
persistent session over the client ID, requesting persistent. persistent session over the client ID, requesting persistent.
If the client wants to create a file on the file system identified by If a file is to be created on a pNFS enabled file system, the client
the FSID that supports pNFS, it issues an OPEN with a create type of uses the OPEN operation as it would normally -- specifying for create
GUARDED4 (if it wants an exclusive create), or UNCHECKED4 (if it does type either GUARDED4 or UNCHECKED4. With the normal set of
not want an exclusive create). Among the various attributes it sets attributes that may be provided upon OPEN used for creation, there is
in createattrs, it includes layout_hint and fills it with information an optional layout_hint attribute. The client's use of layout_hint
pertinent to the layout type it wants to use. The COMPOUND procedure allows the client to express its preference for layout type its
that the OPEN is sent with should include a GETATTR operation (on the associated layout details. The client is advised to combine a
filehandle OPEN sets) that retrieves the layout_type attribute. This GETATTR operation after the OPEN within the same COMPOUND. The
is so the client can determine what layout type the server will in GETATTR may then retrieve the layout_type attribute for the newly
fact support, and thus what storage protocol the client must use. created file. The client will then know what layout type the server
has chosen for the file and therefore what storage protocol the
client must use.
If the client wants to open an existing file, then it also includes a If the client wants to open an existing file, then it also includes a
GETATTR to determine what layout type the file supports. GETATTR to determine what layout type the file supports.
The GETATTR in either the file creation or plain file open case can The GETATTR in either the file creation or plain file open case can
also include the layout_blksize and layout_alignment attributes so also include the layout_blksize and layout_alignment attributes so
that the client can determine optimal offsets and lengths for I/O on that the client can determine optimal offsets and lengths for I/O on
the file. the file.
Assuming the client supports the layout type returned by GETATTR, it Assuming the client supports the layout type returned by GETATTR and
then issues LAYOUTGET using the filehandle returned by OPEN, it chooses to use pNFS for data access, it then issues LAYOUTGET
specifying the range it wants to do I/O on. The response is a layout using the filehandle returned by OPEN, specifying the range it wants
segment, which may be a subset of the range the client asked for. It to do I/O on. The response is a layout, which may be a subset of the
also includes device IDs and a description of how data is organized range for which the client asked. It also includes device IDs and a
(or in the case of writing, how data is to be organized) across the description of how data is organized (or in the case of writing, how
devices. The device IDs and data description are encoded in a format data is to be organized) across the devices. The device IDs and data
that is specific to the layout type, but the client is expected to description are encoded in a format that is specific to the layout
understand. type, but the client is expected to understand.
When the client wants to issue an I/O, it determines which device ID When the client wants to issue an I/O, it determines which device ID
it needs to send the I/O command to by examining the data description it needs to send the I/O command to by examining the data description
in the layout. It then issues a GETDEVICEINFO to find the device in the layout. It then issues a GETDEVICELIST to return a list of
address of the device ID. The client then sends the I/O command to all device ID to device address mappings, or a GETDEVICEINFO to find
device address, using the storage protocol defined for the layout the device address of the device ID. The client then sends the I/O
type. request to device address, using the storage protocol defined for the
layout type. Most importantly, these I/O requests may be done in
parallel.
If the I/O was an input request, then at some point the client may If the I/O was a WRITE, then at some point the client may want to
want to commit the access time to the metadata server. It uses the commit the access time to the metadata server. It uses the
LAYOUTCOMMIT operation. If the I/O was an output request, then at LAYOUTCOMMIT operation. If the I/O was a READ, then at some point
some point the client may want to commit the modification time and the client may want to commit the modification time and the new size
the new size of the file if it believes it lengthed the file, to the of the file if it believes it extended the file size, to the metadata
metadata server and the modified data to the filesystem. Again, it server and the modified data to the file system. Again, it uses
uses LAYOUTCOMMIT. LAYOUTCOMMIT.
12.7. Recovery 12.7. Recovery
Recovery is complicated due to the distributed nature of the pNFS Recovery is complicated by the distributed nature of the pNFS
protocol. In general, crash recovery for layouts is similar to crash protocol. In general, crash recovery for layouts is similar to crash
recovery for delegations in the base NFSv4 protocol. However, the recovery for delegations in the base NFSv4.1 protocol. However, the
client's ability to perform I/O without contacting the metadata client's ability to perform I/O without contacting the metadata
server and the fact that unlike delegations, layouts are not bound to server and the fact that unlike delegations, layouts are not bound to
stateids introduces subtleties that must be handled correctly if file stateids introduces subtleties that must be handled correctly if the
system corruption is to be avoided. possibility of file system corruption is to be avoided.
12.7.1. Client Recovery 12.7.1. Client Recovery
Client recovery for layouts is similar to client recovery for other Client recovery for layouts is similar to client recovery for other
lock/delegation state. When an pNFS client reboots, it will lose all lock and delegation state. When an pNFS client restarts, it will
information about the layouts that it previously owned. There are lose all information about the layouts that it previously owned.
two methods by which the server can reclaim these resources and allow There are two methods by which the server can reclaim these resources
otherwise conflicting layouts to be provided to other clients. and allow otherwise conflicting layouts to be provided to other
clients.
The first is through the expiry of the client's lease. If the client The first is through the expiry of the client's lease. If the client
recovery time is longer than the lease period, the client's lease recovery time is longer than the lease period, the client's lease
will expire and the server will know that state may be released. For will expire and the server will know that state may be released. For
layouts the server may release the state immediately upon lease layouts the server may release the state immediately upon lease
expiry or it may allow the layout to persist awaiting possible lease expiry or it may allow the layout to persist awaiting possible lease
revival, as long as there are no conflicting requests. revival, as long as no other layout conflicts.
On the other hand, the client may restart in less time than it takes On the other hand, the client may restart in less time than it takes
for the lease period to expire. In such a case, the client will for the lease period to expire. In such a case, the client will
contact the server through the standard EXCHANGE_ID protocol. The contact the server through the standard EXCHANGE_ID protocol. The
server will find that the client's co_ownerid matches the co_ownerid server will find that the client's co_ownerid matches the co_ownerid
of the previous client invocation, but that the verifier is of the previous client invocation, but that the verifier is
different. The server uses this as a signal to release all layout different. The server uses this as a signal to release all layout
state associated with the client's previous invocation. It is state associated with the client's previous invocation. In this
possible that all data written by the client to storage devices but scenario, the data written by the client but not covered by a
not completed via LAYOUTCOMMIT is lost. successful LAYOUTCOMMIT is in an undefined state; it may have been
written or it may now be lost. This is acceptable behavior and it is
the client's responsibility to use LAYOUTCOMMIT to achieve the
desired level of stability.
12.7.2. Dealing with Lease Expiration on the Client 12.7.2. Dealing with Lease Expiration on the Client
The mappings between device IDs and device addresses are what allow a The mappings between device IDs and device addresses are what enables
pNFS client to safely write data to and read data from a storage a pNFS client to safely write data to and read data from a storage
device. These mappings are leased (just like with locking state) device. These mappings are leased (in a manner similar to locking
from the metadata server, and as long as the lease is valid, the state) from the metadata server, and as long as the lease is valid,
client has a right to issue I/O to the storage devices. The lease on the client has a ability to issue I/O to the storage devices. The
device ID to device address mappings is renewed when the metadata lease on device ID to device address mappings is renewed when the
server receives a SEQUENCE operation from the pNFS client. The same metadata server receives a SEQUENCE operation from the pNFS client.
is not specified to be true for the data server receiving a SEQUENCE The same is not specified to be true for the data server receiving a
operation, and the client MUST NOT assume that a SEQUENCE sent to a SEQUENCE operation, and the client MUST NOT assume that a SEQUENCE
data server will renew its lease. sent to a data server will renew its lease; this applies to files-
based layouts only.
The loss of the lease leads to the loss of the device ID to device The expiration of the lease leads to the loss of the device ID to
address mappings. If a mapping is used for I/O after lease device address mappings. In the event of lease expiration, the
expiration, the consequences could be data corruption. To avoid server may choose to fence a client's access to the storage device
losing its lease, the client should start its lease timer based on thus preventing the use of expired device ID to device address
the time that it issued the operation to the metadata server rather mappings. To avoid lease expiration, the client should start its
than based on the time the response was received. It is also lease timer based on the time that it issued the operation to the
necessary to take propagation delay into account as described in metadata server rather than based on the time the response was
Section 8.12. Thus, the client must be aware of the one-way received. It is also necessary to take propagation delay into
propagation delay and should issue renewals well in advance of lease account as described in Section 8.12. Thus, the client MUST be aware
expiration. of the one-way propagation delay and should issue renewals well in
advance of lease expiration.
If a client believes its lease has expired, it MUST NOT issue I/O to If a client believes its lease has expired, it MUST NOT issue I/O to
the storage device until it has validated its lease. The client can the storage device until it has validated its lease. The client can
issue a SEQUENCE operation to the metadata server. If the SEQUENCE issue a SEQUENCE operation to the metadata server. If the SEQUENCE
operation is successful, but sr_status_flag has operation is successful, but sr_status_flag has
SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED,
SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or
SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client must recover by SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client MUST not use
deleting all its records of layouts and device ID to device address currently held layout or device id mappings. The client has two
mappings, then writing any modified but uncommitted data in its choices to recover from the lease expiration. First, for all
memory directly to the metadata server with the stable argument to modified but uncommitted data, write it to the metadata server using
WRITE set to FILE_SYNC4, and finally reacquiring any layouts it needs the FILE_SYNC4 flag for the WRITEs or WRITE and COMMIT. Second, the
via LAYOUTGET. client restablish client ID and session with the server and obtain
new layouts and device ID to device address mappings for the modified
data ranges and then write the data to the storage devices with the
newly obtained layouts.
If sr_status_flags from the metadata server has If sr_status_flags from the metadata server has
SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns
NFS4ERR_STATE_CLIENTID, or SEQUENCE returns NFS4ERR_BAD_SESSION and NFS4ERR_STALE_CLIENTID, or SEQUENCE returns NFS4ERR_BAD_SESSION and
CREATE_SESSION returns NFS4ERR_STATE_CLIENTID) then the metadata CREATE_SESSION returns NFS4ERR_STALE_CLIENTID) then the metadata
server has restarted, and the client must recovery using the methods server has restarted, and the client SHOULD recover using the methods
described in Section 12.7.4. described in Section 12.7.4.
If sr_status_flags from the metadata server has If sr_status_flags from the metadata server has
SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following
the procedure described in Section 10.6.7.1. After that, the client the procedure described in Section 10.6.7.1. After that, the client
may get an indication that the layout state was not moved with the may get an indication that the layout state was not moved with the
filesystem. The client is then required the client to recover per file system. The client recovers as in the other applicable
other applicable situations discussed in Paragraph 3 or Paragraph 4 situations discussed in Paragraph 3 or Paragraph 4 of this section.
of this section.
If sr_status_flags reports no loss of state, then the lease the If sr_status_flags reports no loss of state, then the lease for the
client has with the metadata server is valid and renewed, and the mappings the client has with the metadata server are valid and
client can re-commence I/O to the storage devices. renewed, and the client can once again issue I/O requests to the
storage devices.
While clients should not issue I/Os to storage devices that may While clients SHOULD NOT issue I/Os to storage devices that may
extend past the lease expiration time period, this is not always extend past the lease expiration time period, this is not always
possible (e.g. an extended network partition that starts after the possible; for example, an extended network partition that starts
I/O is send and does nor heal till the I/O request is received by the after the I/O is sent and does not heal until the I/O request is
data server). Thus the metadata server and/or storage device are received by the storage device. Thus the metadata server and/or
responsible for protecting the pNFS server from I/Os that are sent storage devices are responsible for protecting themselves from I/Os
before the lease expires, but arrive after the lease expires. See that are sent before the lease expires, but arrive after the lease
Section 12.7.3. expires. See Section 12.7.3.
12.7.3. Dealing with Loss of Layout State on the Metadata Server 12.7.3. Dealing with Loss of Layout State on the Metadata Server
This section describes recovery from the situation where all of the This is a description of the case where all of the following are
following are true: the metadata server has not restarted; a pNFS true:
client's device ID to device address mappings and/or layouts have
been discarded (usually because the client's lease expired) and are o the metadata server has not restarted
invalid; and an I/O from the pNFS client arrives at the storage
device. The metadata server and its storage devices may solve this o a pNFS client's device ID to device address mappings and/or
by fencing the client (i.e. prevent the execution of I/O operations layouts have been discarded (usually because the client's lease
from the client to the storage devices after layout state loss). The expired) and are invalid
details of how fencing is done are specific to the layout type. The
solution for NFSv4.1 file-based layouts is described in this document o an I/O from the pNFS client arrives at the storage device
(Section 13.13), and for other layout types in their respective
The metadata server and its storage devices MUST solve this by
fencing the client. In other words, prevent the execution of I/O
operations from the client to the storage devices after layout state
loss. The details of how fencing is done are specific to the layout
type. The solution for NFSv4.1 file-based layouts is described in
(Section 13.12), and for other layout types in their respective
external specification documents. external specification documents.
12.7.4. Recovery from Metadata Server Restart 12.7.4. Recovery from Metadata Server Restart
The pNFS client will discover that the metadata server has restarted The pNFS client will discover that the metadata server has restarted
(e.g. rebooted) via the methods described in Section 8.6.2 and (e.g. rebooted) via the methods described in Section 8.6.2 and
discussed in a pNFS-specific context in Paragraph 4, of discussed in a pNFS-specific context in Paragraph 4, of
Section 12.7.2. The client MUST stop using and delete device ID to Section 12.7.2. The client MUST stop using layouts and delete device
device address mappings it previously received from the metadata ID to device address mappings it previously received from the
server. Having done that, if the client wrote data to the storage metadata server. Having done that, if the client wrote data to the
device without committing the layout segment(s) via LAYOUTCOMMIT, storage device without committing the layouts via LAYOUTCOMMIT, then
then client has additional work to do in order to get the client, the client has additional work to do in order to have the client,
metadata server and storage device(s) all synchronized on the state metadata server and storage device(s) all synchronized on the state
of the data. of the data.
o If the client has data still modified and unwritten in the o If the client has data still modified and unwritten in the
client's memory, the client has only two choices. client's memory, the client has only two choices.
1. The client can obtain a layout segment via LAYOUTGET after the 1. The client can obtain a layout via LAYOUTGET after the
server's grace period and write the data to the storage server's grace period and write the data to the storage
devices. devices.
2. The client can write that data through the metadata server 2. The client can write that data through the metadata server
using the WRITE (Section 17.32) operation, and then obtain using the WRITE (Section 17.32) operation, and then obtain
layout segments as needed. layouts as desired.
As noted in Paragraph 3 of Section 8.6.2.1, and in
Section 17.43.4, LAYOUTGET and WRITE may not be allowed until the
grace period expires. Under some conditions, as described in
Section 12.7.5, LAYOUTGET and/or WRITE maybe permitted during the
metadata server's grace period.
o If the client synchronously wrote data to the storage device, but o If the client asynchronously wrote data to the storage device, but
still a copy of that data in its memory, then it has available to still has a copy of the data in its memory, then it has available
it the recovery options listed above in the previous bullet point. to it the recovery options listed above in the previous bullet
If the metadata server is also in its grace period, the client has point. If the metadata server is also in its grace period, the
available to it the options below in the next bullet item. client has available to it the options below in the next bullet
item.
o The client does not have a copy of the data in its memory and the o The client does not have a copy of the data in its memory and the
metadata server is still in its grace period. The client cannot metadata server is still in its grace period. The client cannot
use LAYOUTGET (within or outside the grace period) to reclaim a use LAYOUTGET (within or outside the grace period) to reclaim a
layout segment because the contents of the response from LAYOUTGET layout because the contents of the response from LAYOUTGET may not
may not match what it had previously. The range might be match what it had previously. The range might be different or it
different or it might get the same range but the content of the might get the same range but the content of the layout might be
layout might be different. Even if the content of the layout different. Even if the content of the layout appears to be the
appears to be the same, the device IDs may map to difference same, the device IDs may map to different device addresses, and
device addresses, and even if the device addresses are the same, even if the device addresses are the same, the device addresses
the device addresses could have been assigned to a different could have been assigned to a different storage device. The
storage device. The option of retrieving the data from the option of retrieving the data from the storage device and writing
storage device and writing it to the metadata server per the it to the metadata server per the recovery scenario described
recovery scenario described above in the previous two bullets is above in the previous two bullets is not available because, again,
not available because, again, the mappings of range to device ID, the mappings of range to device ID, device ID to device address,
device ID to device address, device address to physical device are device address to physical device are stale and new mappings via
stale and new mappings via new LAYOUTGET do not solve the problem. new LAYOUTGET do not solve the problem.
The only recovery option for this scenario is to issue a The only recovery option for this scenario is to issue a
LAYOUTCOMMIT in reclaim mode, which the metadata server will LAYOUTCOMMIT in reclaim mode, which the metadata server will
accept as long as it is in its grace period. The use of accept as long as it is in its grace period. The use of
LAYOUTCOMMIT in reclaim mode informs the metadata server that the LAYOUTCOMMIT in reclaim mode informs the metadata server that the
layout segment has changed. It is critical the metadata server layout has changed. It is critical the metadata server receive
receive this information before its grace period ends, and thus this information before its grace period ends, and thus before it
before it starts allowing updates to the filesystem. starts allowing updates to the file system.
To issue LAYOUTCOMMIT in reclaim mode, the client sets the To issue LAYOUTCOMMIT in reclaim mode, the client sets the
loca_reclaim field of the operation's arguments (Section 17.42.2) loca_reclaim field of the operation's arguments (Section 17.42.2)
to TRUE. During the metadata server's recovery grace period (and to TRUE. During the metadata server's recovery grace period (and
only during the recovery grace period) the metadata server is only during the recovery grace period) the metadata server is
prepared to accept LAYOUTCOMMIT requests with the loca_reclaim prepared to accept LAYOUTCOMMIT requests with the loca_reclaim
field set to TRUE. field set to TRUE.
When loca_reclaim is TRUE, the client is attempting to commit When loca_reclaim is TRUE, the client is attempting to commit
changes to the layout segment that occurred prior to the restart changes to the layout that occurred prior to the restart of the
of the metadata server. The metadata server applies some metadata server. The metadata server applies some consistency
consistency checks on the loca_layoutupdate field of the arguments checks on the loca_layoutupdate field of the arguments to
to determine whether the client can commit the data written to the determine whether the client can commit the data written to the
data server to the filesystem. The loca_layoutupdate field is of storge device to the file system. The loca_layoutupdate field is
data type layoutupdate4, and contains layout type-specific content of data type layoutupdate4, and contains layout type-specific
(in the lou_body field of loca_layoutupdate). The layout type- content (in the lou_body field of loca_layoutupdate). The layout
specific information that loca_layoutupdate might have is type-specific information that loca_layoutupdate might have is
discussed in Section 12.5.3.3. If the metadata server's discussed in Section 12.5.3.3. If the metadata server's
consistency checks on loca_layoutupdate succeed, then the metadata consistency checks on loca_layoutupdate succeed, then the metadata
server MUST commit the data (as described by the loca_offset, server MUST commit the data (as described by the loca_offset,
loca_length, and loca_layoutupdate fields of the arguments) that loca_length, and loca_layoutupdate fields of the arguments) that
was written to storage device. If the metadata server's was written to storage device. If the metadata server's
consistency checks on loca_layoutupdate fail, the metadata server consistency checks on loca_layoutupdate fail, the metadata server
rejects the LAYOUTCOMMIT operation, and makes no changes to the rejects the LAYOUTCOMMIT operation, and makes no changes to the
file system. However, any time LAYOUTCOMMIT with loca_reclaim file system. However, any time LAYOUTCOMMIT with loca_reclaim
TRUE fails, the pNFS client has lost all the data in the range TRUE fails, the pNFS client has lost all the data in the range
defined by <loca_offset, loca_length>. A client can defend defined by <loca_offset, loca_length>. A client can defend
against this risk by caching all data, whether written against this risk by caching all data, whether written
synchronously or asynchronously in its memory and not release the synchronously or asynchronously in its memory and not release the
cached data until a successful LAYOUTCOMMIT. cached data until a successful LAYOUTCOMMIT. This condition is
does not hold true for all layout types; for example, files-based
storage devices need not suffer from this limitation.
o The client does not have a copy of the data in its memory and the o The client does not have a copy of the data in its memory and the
metadata server is no longer in its grace period; i.e. the metadata server is no longer in its grace period; i.e. the
metadata server returns NFS4ERR_NO_GRACE. As with the scenario in metadata server returns NFS4ERR_NO_GRACE. As with the scenario in
the above bullet item, the failure of LAYOUTCOMMIT means the data the above bullet item, the failure of LAYOUTCOMMIT means the data
in the range <loca_offset, loca_length> lost. The defense against in the range <loca_offset, loca_length> lost. The defense against
the risk is the same; cache all written data on the client until a the risk is the same; cache all written data on the client until a
successful LAYOUTCOMMIT. successful LAYOUTCOMMIT.
12.7.5. Operations During Metadata Server Grace Period 12.7.5. Operations During Metadata Server Grace Period
Some of the recovery scenarios thus far noted that some operations, Some of the recovery scenarios thus far noted that some operations,
namely WRITE and LAYOUTGET might be permitted during the metadata namely WRITE and LAYOUTGET might be permitted during the metadata
server's grace period. The metadata server may allow these server's grace period. The metadata server may allow these
operations during its grace period, if it can reliably determine that operations during its grace period. For LAYOUTGET, the metadata
servicing such a request will not conflict with an impending server must reliably determine that servicing such a request will not
LAYOUTCOMMIT (or, in the case of WRITE, conflicting with an impending conflict with an impending LAYOUTCOMMIT reclaim request. For WRITE,
OPEN, or a LOCK on a file with mandatory record locking enabled) it must reliably determine that it will not conflict with an
reclaim request. As mentioned previously, some operations, namely impending OPEN; or a LOCK where the file has mandatory file locking
WRITE and LAYOUTGET are likely to be rejected during the metadata enabled.
server's grace period, because to provide simple, valid handling
during the grace period, the easiest method is to simply reject all As mentioned previously, some operations, namely WRITE and LAYOUTGET
non-reclaim pNFS requests and WRITE operations by returning the may be rejected during the metadata server's grace period, because to
NFS4ERR_GRACE error. However, depending on the storage protocol provide simple, valid handling during the grace period, the easiest
(which is specific to the layout type) and metadata server method is to simply reject all non-reclaim pNFS requests and WRITE
implementation, the metadata server may be able to determine that a operations by returning the NFS4ERR_GRACE error. However, depending
particular request is safe. For example, a metadata server may save on the storage protocol (which is specific to the layout type) and
provisional allocation mappings for each file to stable storage, as metadata server implementation, the metadata server may be able to
well as information about potentially conflicting OPEN share modes determine that a particular request is safe. For example, a metadata
and mandatory record locks that might have been in effect at the time server may save provisional allocation mappings for each file to
of restart, and use this information during the recovery grace period stable storage, as well as information about potentially conflicting
to determine that a WRITE request is safe. OPEN share modes and mandatory record locks that might have been in
effect at the time of restart, and use this information during the
recovery grace period to determine that a WRITE request is safe.
12.7.6. Storage Device Recovery 12.7.6. Storage Device Recovery
Recovery from storage device restart is mostly dependent upon the Recovery from storage device restart is mostly dependent upon the
layout type in use. However, there are a few general techniques a layout type in use. However, there are a few general techniques a
client can use if it discovers a storage device has crashed while client can use if it discovers a storage device has crashed while
holding modified, uncommitted data that was asynchronously written. holding modified, uncommitted data that was asynchronously written.
First and foremost, it is important to realize that the client is the First and foremost, it is important to realize that the client is the
only one who has the information necessary to recover non-committed only one who has the information necessary to recover non-committed
data; since, it holds the modified data and most probably nobody else data; since, it holds the modified data and most probably nobody else
does. Second, the best solution is for the client to err on the side does. Second, the best solution is for the client to err on the side
of caution and attempt to re-write the modified data through another of caution and attempt to re-write the modified data through another
path. path.
The client should immediately write the data to the metadata server, The client SHOULD immediately write the data to the metadata server,
with the stable field in the WRITE4args set to FILE_SYNC4. Once it with the stable field in the WRITE4args set to FILE_SYNC4. Once it
does this, there is no need to wait for the original storage device. does this, there is no need to wait for the original storage device.
12.8. Metadata and Storage Device Roles 12.8. Metadata and Storage Device Roles
If the same physical hardware is used to implement both a metadata If the same physical hardware is used to implement both a metadata
server and storage device, then the same hardware entity is to be server and storage device, then the same hardware entity is to be
understood to be implementing two distinct roles and it is important understood to be implementing two distinct roles and it is important
that it be clearly understood on behalf of which role the hardware is that it be clearly understood on behalf of which role the hardware is
executing at any given time. executing at any given time.
Various sub-cases can be distinguished. Various sub-cases can be distinguished.
1. The storage device uses NFSv4.1 as the storage protocol. The 1. The storage device uses NFSv4.1 as the storage protocol. The
same physical hardware is used to implement both a metadata and same physical hardware is used to implement both a metadata and
data server. If an EXCHANGE_ID operation issued to the metadata data server. If an EXCHANGE_ID operation issued to the metadata
server has EXCHGID4_FLAG_USE_PNFS_MDS set and not server has EXCHGID4_FLAG_USE_PNFS_MDS set and
EXCHGID4_FLAG_USE_PNFS_DS not set, the role of all sessions EXCHGID4_FLAG_USE_PNFS_DS not set, the role of all sessions
derived from the client ID is metadata server-only. If an derived from the client ID is metadata server-only. If an
EXCHANGE_ID operation issued to the data server has EXCHANGE_ID operation issued to the data server has
EXCHGID4_FLAG_USE_PNFS_DS set and EXCHGID4_FLAG_USE_PNFS_MDS not EXCHGID4_FLAG_USE_PNFS_DS set and EXCHGID4_FLAG_USE_PNFS_MDS not
set, the role of all sessions derived from the client ID is data set, the role of all sessions derived from the client ID is data
server only. These assertions are true regardless whether the server only. These assertions are true regardless whether the
network addresses of the metadata server and data server are the network addresses of the metadata server and data server are the
same or not. same or not.
The client will use the same client owner for both the metadata The client will use the same client owner for both the metadata
skipping to change at page 260, line 20 skipping to change at page 261, line 39
operations. operations.
Note, that it may be the case that while the metadata server and Note, that it may be the case that while the metadata server and
the storage device are distinct from one client's point of view, the storage device are distinct from one client's point of view,
the roles may be reversed according to another client's point of the roles may be reversed according to another client's point of
view. For example, in the cluster file system model a metadata view. For example, in the cluster file system model a metadata
server to one client, may be a data server to another client. If server to one client, may be a data server to another client. If
NFSv4.1 is being used as the storage protocol, then pNFS servers NFSv4.1 is being used as the storage protocol, then pNFS servers
need to mark filehandles according to their specific roles. need to mark filehandles according to their specific roles.
If a current filehandle is set that is inconsistent with the role
to which it is directed, then the error NFS4ERR_BADHANDLE should
result. For example, if a request is directed at the data
server, because the first current handle is from a layout, any
attempt to set the current filehandle to be a value not from a
layout should be rejected. Similarly, if the first current file
handle was for a value not from a layout, a subsequent attempt to
set the current filehandle to a value obtained from a layout
should be rejected.
3. The storage device does not use NFSv4.1 as the storage protocol, 3. The storage device does not use NFSv4.1 as the storage protocol,
and the same physical hardware is used to implement both a and the same physical hardware is used to implement both a
metadata and storage device. Whether distinct network addresses metadata and storage device. Whether distinct network addresses
are used to access metadata server and storage device is are used to access metadata server and storage device is
immaterial, because, it is always clear to the pNFS client and immaterial, because, it is always clear to the pNFS client and
server, from upper layer protocol being used (NFSv4.1 or non- server, from upper layer protocol being used (NFSv4.1 or non-
NFSv4.1) what role the request to the common server network NFSv4.1) what role the request to the common server network
address is directed to. address is directed to.
12.9. Security Considerations 12.9. Security Considerations
PNFS has a metadata path and a data path (i.e., storage protocol). pNFS has separates and provides access to file system metadata and
The metadata path includes the pNFS-specific operations (listed in data. There are pNFS-specific operations (listed in Section 12.3)
Section 12.3); all existing NFSv4.1 conventional (non-pNFS) security that provide access to the metadata; all existing NFSv4.1
mechanisms and features apply to the metadata path. The combination conventional (non-pNFS) security mechanisms and features apply to
of components in a pNFS system (see Figure 64) is required to accessing the metadata. The combination of components in a pNFS
preserve the security properties of NFSv4.1 with respect to an entity system (see Figure 65) is required to preserve the security
accessing storage device from a client, including security properties of NFSv4.1 with respect to an entity accessing storage
countermeasures to defend against threats that NFSv4.1 provides device from a client, including security countermeasures to defend
defenses for in environments where these threats are considered against threats that NFSv4.1 provides defenses for in environments
significant. where these threats are considered significant.
In some cases, the security countermeasures for connections to In some cases, the security countermeasures for connections to
storage devices may take the form of physical isolation or a storage devices may take the form of physical isolation or a
recommendation not to use pNFS in an environment. For example, it recommendation not to use pNFS in an environment. For example, it
may be impractical to provide confidentiality protection for some may be impractical to provide confidentiality protection for some
storage protocols to protect against eavesdropping; in environments storage protocols to protect against eavesdropping; in environments
where eavesdropping on such protocols is of sufficient concern to where eavesdropping on such protocols is of sufficient concern to
require countermeasures, physical isolation of the communication require countermeasures, physical isolation of the communication
channel (e.g., via direct connection from client(s) to storage channel (e.g., via direct connection from client(s) to storage
device(s)) and/or a decision to forego use of pNFS (e.g., and fall device(s)) and/or a decision to forego use of pNFS (e.g., and fall
back to conventional NFSv4.1) may be appropriate courses of action. back to conventional NFSv4.1) may be appropriate courses of action.
Where communication with storage devices is subject to the same Where communication with storage devices is subject to the same
threats as client to metadata server communication, the protocols threats as client to metadata server communication, the protocols
used for that communication need to provide security mechanisms used for that communication need to provide security mechanisms as
comparable to those available via RPSEC_GSS for NFSv4.1. Many strong as or no weaker than those available via RPSEC_GSS for
situations in which pNFS is likely to be used will not be subject to NFSv4.1.
the overall threat profile for which NFSv4.1 is required to provide
countermeasures.
PNFS implementations MUST NOT remove NFSv4's access controls. The pNFS implementations MUST NOT remove NFSv4.1's access controls. The
combination of clients, storage devices, and the metadata server are combination of clients, storage devices, and the metadata server are
responsible for ensuring that all client to storage device file data responsible for ensuring that all client to storage device file data
access respects NFSv4.1's ACLs and file open modes. This entails access respects NFSv4.1's ACLs and file open modes. This entails
performing both of these checks on every access in the client, the performing both of these checks on every access in the client, the
storage device, or both (as applicable; when the storage device is an storage device, or both (as applicable; when the storage device is an
NFSv4.1 server, the storage device is ultimately responsible for NFSv4.1 server, the storage device is ultimately responsible for
controlling access). If a pNFS configuration performs these checks controlling access). If a pNFS configuration performs these checks
only in the client, the risk of a misbehaving client obtaining only in the client, the risk of a misbehaving client obtaining
unauthorized access is an important consideration in determining when unauthorized access is an important consideration in determining when
it is appropriate to use such a pNFS configuration. Such it is appropriate to use such a pNFS configuration. Such layout
configurations SHOULD NOT be used when client-only access checks do types SHOULD NOT be used when client-only access checks do not
not provide sufficient assurance that NFSv4.1 access control is being provide sufficient assurance that NFSv4.1 access control is being
applied correctly. applied correctly.
13. PNFS: NFSv4.1 File Layout Type 13. PNFS: NFSv4.1 File Layout Type
This section describes the semantics and format of NFSv4.1 file-based This section describes the semantics and format of NFSv4.1 file-based
layouts for pNFS. NFSv4.1 file-based layouts uses the layouts for pNFS. NFSv4.1 file-based layouts uses the
LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type
defines striping data across multiple NFSv4.1 data servers. defines striping data across multiple NFSv4.1 data servers.
13.1. Session Considerations 13.1. Client ID and Session Considerations
Sessions are a mandatory feature of NFSv4.1, and this extends to both Sessions are a mandatory feature of NFSv4.1, and this extends to both
the metadata server and file-based (NFSv4.1-based) data servers. If the metadata server and file-based (NFSv4.1-based) data servers.
data is served by both the metadata server and an NFSv4.1-based data
server, the metadata and data server MUST have separate client IDs
(unless the EXCHANGE_ID results indicate the server will allow the
client ID to support both metadata and data pNFS operations).
When a creating a client ID to access a pNFS metadata server, the The role a server plays in pNFS is determined by the result it
pNFS metadata client sends an EXCHANGE_ID operation that has returns from EXCHANGE_ID. The roles are:
EXCHGID4_FLAG_USE_PNFS_MDS set (EXCHGID4_FLAG_USE_NON_PNFS and
EXCHGID4_FLAG_USE_PNFS_DS MAY be set as well). If the server's
EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_MDS set, then the
client may use the client ID to create sessions that will exchange
pNFS metadata operations.
If pNFS metadata client gets a layout that refers it to an NFSv4.1 o metadata server (EXCHGID4_FLAG_USE_PNFS_MDS is set in the result
eir_flags),
o data server (EXCHGID4_FLAG_USE_PNFS_DS)
o non-metadata server (EXCHGID4_FLAG_USE_NON_PNFS). This is an
NFSv4.1 server that does not support operations (e.g. LAYOUTGET)
or attributes that pertain to pNFS.
The client MAY request zero or more of EXCHGID4_FLAG_USE_NON_PNFS,
EXCHGID4_FLAG_USE_PNFS_DS, or EXCHGID4_FLAG_USE_PNFS_MDS, even though
some combinations (e.g. EXCHGID4_FLAG_USE_NON_PNFS |
EXCHGID4_FLAG_USE_PNFS_MDS) are contradictory. The server however
MUST only return the following acceptable combinations:
+--------------------------------------------------------+
| Acceptable Results from EXCHANGE_ID |
+--------------------------------------------------------+
| EXCHGID4_FLAG_USE_PNFS_MDS |
| EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS |
| EXCHGID4_FLAG_USE_PNFS_DS |
| EXCHGID4_FLAG_USE_NON_PNFS |
| EXCHGID4_FLAG_USE_PNFS_DS | EXCHGID4_FLAG_USE_NON_PNFS |
+--------------------------------------------------------+
As the above table implies, a server can have one or two roles. A
server can be both a metadata server and a data server or it can be
both a data server and non-metadata server. In addition to returning
two roles in EXCHANGE_ID's results, and thus serving both roles via a
common client ID, a server can serve two roles by returning a unique
client ID and server owner for each role in each of two EXCHANGE_ID
results, with each result indicating each role.
If a pNFS metadata client gets a layout that refers it to an NFSv4.1
data server, it needs a client ID on that data server. If it does data server, it needs a client ID on that data server. If it does
not yet have a client ID from the server that had the not yet have a client ID from the server that had the
EXCHGID4_FLAG_USE_PNFS_DS flag set in the EXCHANGE_ID results, then EXCHGID4_FLAG_USE_PNFS_DS flag set in the EXCHANGE_ID results, then
the client must send an EXCHANGE_ID to the data server, using the the client must send an EXCHANGE_ID to the data server, using the
same co_ownerid as it sent to the metadata server, with the same co_ownerid as it sent to the metadata server, with the
EXCHGID4_FLAG_USE_PNFS_DS flag set in arguments. If the server's EXCHGID4_FLAG_USE_PNFS_DS flag set in the arguments. If the server's
EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_DS set, then the EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_DS set, then the
client may use the client ID to create sessions that will exchange client may use the client ID to create sessions that will exchange
pNFS data operations. pNFS data operations. The client ID returned by the data server has
no relationship with the client ID returned by a metadata server
The client ID returned by a metadata server has no required unless the client IDs are equal and the server owners and server
association to the client ID returned by a data server that the scopes of the data server and metadata server are equal.
metadata server's layouts referred the client to, although a server
implementation is free construct such an association (e.g. via a
private data server/metadata server protocol and client ID table).
Similarly the EXCHANGE_ID/CREATE_SESSION sequence id state used by
the pNFS metadata client and server has no association with the
EXCHANGE_ID/CREATE_SESSION sequence id state used by the data client/
server (and the pNFS server and the pNFS client MUST NOT make this
association). By decoupling the client IDs of metadata and data
servers from each other, implementation of the session on pNFS
servers is potentially simpler.
In a non-pNFS server or in a metadata server, the sessionid in the In NFSv4.1, the sessionid in the SEQUENCE operation implies the
SEQUENCE operation implies the client ID, which in turn might be used client ID, which in turn might be used by the server to map the
by the server to map the stateid to the right client/server pair. stateid to the right client/server pair. However, when a data server
However, when a data server is presented with a READ or WRITE is presented with a READ or WRITE operation with a stateid, because
operation with a stateid, because the stateid is associated with the stateid is associated with client ID on a metadata server, and
client ID on a metadata server, and because the sessionid in the because the sessionid in the preceding SEQUENCE operation is tied to
preceding SEQUENCE operation is tied to the potentially unrelated the client ID of the data server, the data server has no obvious way
data server client ID, the data server has no obvious way to to determine the metadata server from the COMPOUND procedure, and
determine the metadata server from the COMPOUND procedure, and thus thus has no way to validate the stateid. One recommended approach is
has no way to validate the stateid. One recommended approach is for for pNFS servers to encode metadata server routing and/or identity
pNFS servers to encode metadata server routing and/or identity
information in the data server filehandles as returned in the layout. information in the data server filehandles as returned in the layout.
If the metadata server identity or location changes, requiring the
data server filehandles to become invalid (stale), the metadata
server must first recall the layouts.
Invalidating data server filehandles does not render the pNFS data If metadata server routing and/or identity information is encoded in
cache invalid. If the metadata server file handle of a file is data server filehandles, when the metadata server identity or
persistent, the client can map the metadata server filehandle to location changes, the data server filehandles it gave out must become
cached data, and when granted data server filehandles, map the data become invalid (stale), and so the metadata server must first recall
server filehandles to their metadata server filehandle. the layouts. Invalidating a data server filehandle does not render
the NFS client's data cache invalid. The client's cache should map a
data server filehandle to a metadata server filehandle, and a
metadata server filehandle to cached data.
13.2. File Layout Definitions 13.2. File Layout Definitions
The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout
type, and may be applicable to other layout types. type, and may be applicable to other layout types.
Unit. A unit is a set of data written to a data server. Unit. A unit is a fixed size quantity of data written to a data
server.
Pattern. A pattern is a method of distributing fix sized units Pattern. A pattern is a method of distributing one or more equal
across a set of data servers. A pattern is iterated one or more sized units across a set of data servers. A pattern is iterated
times. A pattern has one or more units. Each unit in each one or more times.
iteration of a pattern MUST be the same size.
Stripe. An stripe is a set of data distributed across a set of data Stripe. An stripe is a set of data distributed across a set of data
servers in a pattern before that pattern repeats. servers in a pattern before that pattern repeats.
Stripe Width. A stripe width is the size of stripe in octets. Stripe Count. A stripe count is the number of stripe units in a
pattern.
Stripe Width. A stripe width is the size of stripe in octets. The
stripe width = the stripe count * the size of the stripe unit.
Hereafter, this document will refer to a unit that is a written in a Hereafter, this document will refer to a unit that is a written in a
pattern as a "stripe unit". pattern as a "stripe unit".
A pattern may have more stripe units than data servers. If so, some A pattern may have more stripe units than data servers. If so, some
data servers will have more than one stripe unit per stripe. A data data servers will have more than one stripe unit per stripe. A data
server that has multiple stripe units per stripe MAY store each unit server that has multiple stripe units per stripe MAY store each unit
in a different data file. in a different data file (and depending on the implementation, will
possibly assign a unique data file handle to each data file).
13.3. File Layout Data Types 13.3. File Layout Data Types
The high level NFSv4.1 layout types are nfsv4_1_file_layout_ds_addr4, The high level NFSv4.1 layout types are nfsv4_1_file_layouthint4,
nfsv4_1_file_layouthint4, and nfsv4_1_file_layout4. nfsv4_1_file_layout_ds_addr4, and nfsv4_1_file_layout4.
When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in
the loc_type field of the lo_content field), the loc_body field of
the lo_content field contains a value of data type
nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has
storage device IDs (within the nfl_ds_fh_list array) of data type
deviceid4.
The GETDEVICEINFO operation maps a device ID to a storage device
address (type device_addr4). When GETDEVICEINFO returns a device
address with a layout type of LAYOUT4_NFSV4_1_FILES (the
da_layout_type field), the da_addr_body field contains a value of
data type nfsv4_1_file_layout_ds_addr4.
The SETATTR operation supports a layout hint attribute The SETATTR operation supports a layout hint attribute
(Section 5.13.4). When the client sets a layout hint (data type (Section 5.13.4). When the client sets a layout hint (data type
layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the
loh_type field), the loh_body field contains a value of data type loh_type field), the loh_body field contains a value of data type
nfsv4_1_file_layouthint4. nfsv4_1_file_layouthint4.
The top level and lower level NFSv4.1 layout data types have the const NFL4_UFLG_MASK = 0x0000003F;
following XDR descriptions. const NFL4_UFLG_DENSE = 0x00000001;
const NFL4_UFLG_COMMIT_THRU_MDS = 0x00000002;
const NFL4_UFLG_STRIPE_UNIT_SIZE_MASK = 0xFFFFFFC0;
enum file_layout_ds_type4 { typedef uint32_t nfl_util4;
FILEDS4_SIMPLE = 1,
FILEDS4_COMPLEX = 2
};
%/* Encoded in the da_addr_body field of type device_addr4: */ /* Encoded in the loh_body field of type layouthint4: */
union nfsv4_1_file_layout_ds_addr4
switch (file_layout_ds_type4 nflda_type) {
case FILEDS4_SIMPLE:
netaddr4 nflda_simp_ds_list<>;
case FILEDS4_COMPLEX:
deviceid4 nflda_comp_ds_list<>;
default:
void;
};
enum stripetype4 { const NFLH4_CARE_DENSE = NFL4_UFLG_DENSE;
STRIPE4_SPARSE = 1, const NFLH4_CARE_COMMIT_THRU_MDS = NFL4_UFLG_COMMIT_THRU_MDS;
STRIPE4_DENSE = 2 const NFLH4_CARE_STRIPE_UNIT_SIZE = 0x00000040;
}; const NFLH4_CARE_STRIPE_COUNT = 0x00000080;
%/* Encoded in the loh_body field of type layouthint4: */
struct nfsv4_1_file_layouthint4 { struct nfsv4_1_file_layouthint4 {
stripetype4 nflh_stripe_type; uint32_t nflh_care;
length4 nflh_stripe_unit_size; nfl_util4 nflh_util;
uint32_t nflh_stripe_width; count4 nflh_stripe_count;
}; };
struct nfsv4_1_file_layout_ds_fh4 { The generic layout hint structure is described in Section 3.2.21.
deviceid4 nfldf_ds_id; The client uses the layout hint in the layout_hint (Section 5.13.4)
uint32_t nfldf_ds_index; attribute to indicate the preferred type of layout to be used for a
nfs_fh4 nfldf_fh; newly created file. The LAYOUT4_NFSV4_1_FILES layout type-specific
content for the layout hint is composed of two fields. The first
field, nflh_care, is a set of flags indicating which values of the
hint the client cares about. If the NFLH4_CARE_DENSE flag is set,
then the client indicates in the second field, nflh_util, a
preference for how the data file is packed (Section 13.5), which is
controlled by the value of nflh_util & NFL4_UFLG_DENSE. If the
NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a
preference for whether the client should send COMMIT operations to
the metadata server or data server (Section 13.8), which is
controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If
the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its
preferred stripe unit size, which is indicated nflh_util &
NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus the stripe unit size MUST be a
multiple of 64 octets). If the NFLH4_CARE_STRIPE_COUNT flag is set,
the client indicates in the third field, nflh_stripe_count, the
stripe count. The stripe count multiplied by the stripe unit size is
the stripe width.
When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in
the loc_type field of the lo_content field), the loc_body field of
the lo_content field contains a value of data type
nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has
a storage device ID (field nfl_deviceid) of data type deviceid4. The
GETDEVICEINFO operation maps a device ID to a storage device address
(type device_addr4). When GETDEVICEINFO returns a device address
with a layout type of LAYOUT4_NFSV4_1_FILES (the da_layout_type
field), the da_addr_body field contains a value of data type
nfsv4_1_file_layout_ds_addr4.
typedef netaddr4 multipath_list4<>;
/* Encoded in the da_addr_body field of type device_addr4: */
struct nfsv4_1_file_layout_ds_addr4 {
uint32_t nflda_stripe_indices<>;
multipath_list4 nflda_multipath_ds_list<>;
}; };
%/* Encoded in the loc_body field of type layout_content4: */ The nfsv4_1_file_layout_ds_addr4 data type represents the device
address. It is composed of two fields:
1. nflda_multipath_ds_list: An array of lists of data servers, where
each list can be one or more elements, and each element
represents an equivalent (see Section 13.6) data server.
2. nfla_stripe_indices: An array of indexes used to index into
nflda_multipath_ds_list. The number of elements in
nfla_stripe_indices is always equal to the stripe count.
/* Encoded in the loc_body field of type layout_content4: */
struct nfsv4_1_file_layout4 { struct nfsv4_1_file_layout4 {
stripetype4 nfl_stripe_type; deviceid4 nfl_deviceid;
bool nfl_commit_through_mds; nfl_util4 nfl_util;
length4 nfl_stripe_unit_size; uint32_t nfl_first_stripe_index;
length4 nfl_file_size; nfs_fh4 nfl_fh_list<>;
uint32_t nfl_stripe_indices<>;
nfsv4_1_file_layout_ds_fh4 nfl_ds_fh_list<>;
}; };
%/* The nfsv4_1_file_layout4 data type represents the layout. It is
% * Encoded in the lou_body field of type layoutupdate4: composed of four fields:
% * Nothing. lou_body is a zero length array of octets.
% */
The nfsv4_1_file_layout_ds_addr4 data server address is composed of a
FILEDS4_SIMPLE or a FILEDS4_COMPLEX data server address. A
FILEDS4_SIMPLE data server address is composed of an array of network
addresses (data type netaddr4). All data servers in a FILEDS4_SIMPLE
list (field nflda_simp_ds_list) must be equivalent and are used for
data server multipathing; see Section 13.6 for more details on
equivalent data servers. FILEDS4_SIMPLE data servers always refer to
actual data servers. On the other hand, a FILEDS4_COMPLEX data
server address is constructed of list of device IDs (field
nflda_comp_ds_list). Each device ID in nflda_comp_ds_list
corresponds to the device ID of a data server address of type
FILEDS4_SIMPLE. A FILEDS4_COMPLEX data server list MUST NOT contain
device IDs of other FILEDS4_COMPLEX data servers; only device IDs of
FILEDS4_SIMPLE data servers are to be referenced. This enables
multiple equivalent data servers to be identified through a single
device ID and provides a space efficient mechanism by which to
identify multiple data servers within a layout. FILEDS4_COMPLEX and
FILEDS4_SIMPLE data servers share the same device ID space and should
be cached similarly by the client.
The nfsv4_1_file_layout4 data type specifies an ordered array of 1. nfl_deviceid: The device ID which maps to a value of type
<device ID, filehandle> tuples, as well as the stripe unit size, type nfsv4_1_file_layout_ds_addr4.
of stripe layout (discussed later in this section and in
Section 13.4), and the file's current size as of LAYOUTGET
(Section 17.43) time.
The nfl_ds_fh_list array within the nfsv4_1_file_layout4 data type 2. nfl_util: Like the nflh_util field of data type
contains a list of nfsv4_1_file_layout_devfh4 structures. Each of nfsv4_1_file_layouthint4, a compact representation of how the
these structures describes one or more FILEDS4_SIMPLE or data on a file on each data server is packed, whether the client
FILEDS4_COMPLEX data servers that contribute to a stripe of the file. should send COMMIT operations to the metadata server or data
The nfl_stripe_indices array contains a list of indices into the server, and the stripe unit size.
nfl_ds_fh_list array; an index of zero specifies the first entry in
nfl_ds_fh_list. Each successive index selects a nfl_ds_fh_list entry
which are to be used next in sequence for that stripe. This allows
an arbitrary sequencing through the possible data servers to be
encoded compactly. The value of every element in nfl_stripe_indices
must be less than the number of elements in the nfl_ds_fh_list array.
When the nfl_stripe_indices array is of zero length, the elements of 3. nfl_first_stripe_index: The index into the first element of the
the nfl_ds_fh_list array are simply used in order, so that the nfla_stripe_indices array to use.
portion of the stripe held by the corresponding entry is determined
by its position within the data server list.
If the nfl_stripe_indices array is of non-zero length, there is no 4. nfl_fh_list. An array of data server filehandles for each list
requirement that the nfl_stripe_indices and nfl_ds_fh_list arrays of data servers in each element of the nflda_multipath_ds_list
have the same number of entries. If the nfl_stripe_indices array has array. The number of elements in nfl_fh_list MUST be of three
fewer entries than the nfl_ds_fh_list array, this simply means not values:
all entries of nfl_ds_fh_list are in the striping pattern.
Even if nfl_stripe_indices has the same number of entries as the * Zero. This means that filehandles used for each data server
nfl_ds_fh_list array, this does not necessarily mean all entries of are the same as the filehandle returned by the OPEN operation
nfl_ds_fh_list are used, because nothing prevents an index value from from the metadata server.
appearing in multiple entries of nfl_stripe_indices.
If the nfl_stripe_indices array has more entries than the * One. This means that every data server uses the same
nfl_ds_fh_list array, then this simply means index values in filehandle: what is specified in nfl_fh_list[0].
nfl_stripe_indices are appearing more than once.
Each nfl_ds_fh_list entry contains a device ID, data server index, * The stripe count, i.e. the number of elements as
and a filehandle. The device ID (field nfldf_ds_id), identifies the nflda_multipath_ds_list. When issuing an I/O to any data
data server. The GETDEVICEINFO operation is used to map nfldf_ds_id server in nfla_multipath_ds_list[X], the filehandle in
to a data server address, which will be either a FILEDS4_COMPLEX or nfl_fh_list[X] MUST be used.
FILEDS4_SIMPLE data server address. When the device ID maps to a
FILEDS4_COMPLEX data server address server, the data server index
(field nfldf_ds_index) indicates the starting element of the to use
from the list of device IDs (nflda_comp_ds_list) of the
FILEDS4_COMPLEX address. (As discussed in Section 13.4 the
nfldf_ds_index field plays a critical role in the flattening of a
FILEDS4_COMPLEX device.) If the nfldf_ds_id field maps to a
FILEDS4_SIMPLE device, the nfldf_ds_index field has no meaning and
should be zero. The filehandle, nfldf_fh, identifies the file on the
data server identified by the device ID.
The generic layout hint structure is described in Section 3.2.22. The details on the interpretation of the layout are in Section 13.4.
The client uses the layout hint in the layout_hint (Section 5.13.4)
attribute to specify the type of layout to be used for a newly
created file. The LAYOUT4_NFSV4_1_FILES layout type-specific content
for the layout hint is composed of the preferred stripe packing type
(field nflh_stripe_type, discussed in Section 13.5), the size of the
stripe unit (field nflh_stripe_unit_size), and the width of the
stripe (field nflh_stripe_width).
13.4. Interpreting the File Layout 13.4. Interpreting the File Layout
The client is expected to construct a flat list of <data server, file The algorithm for the file handle and set of data server network
handle> pairs over which the file is striped. A flat data server addresses to write stripe unit i (SUi) to is:
list contains no FILEDS4_COMPLEX data servers, and is constructed by
concatenating each data server encountered while traversing stripe_count = number of elements in nflda_stripe_indices;
nfl_stripe_indices (or nfl_ds_fh_list in the case of a zero sized fh_count = number of elements in nfl_fh_list;
nfl_stripe_indices array), while expanding each FILEDS4_COMPLEX data
server address. The client must expand the FILEDS4_COMPLEX data idx = nflda_stripe_indices[
server address's device ID list by starting at the device ID entry of ( SUi + nfl_first_stripe_index ) % stripe_count;
the nflda_comp_ds_list array indexed by nfldf_ds_index, ending with
the device ID prior to nfldf_ds_index (or ending with the last entry if (stripe_count == fh_count) {
the of the nflda_comp_ds_list array if nfldf_ds_index is zero. All // Per specification, idx is within bounds of nfl_fh_list
devices IDs in the nflda_comp_ds_list must be consumed; this may fh = nfl_fh_list[idx];
require wrapping around the end of the array if nfldf_ds_index is
non-zero. The stripe width is determined by the stripe unit size } else if (fh_count == 1) {
multiplied by the number of data server entries within the flattened fh = nfl_fh_list[0];
stripe.
} else {
fh = filehandle returned by OPEN;
}
// per specification idx is within bounds of
// nflda_multipath_ds_list.
address_list = nflda_multipath_ds_list[idx];
The client would then select a data server from address_list, and
issue a READ or WRITE operation using the filehandle specified in fh.
Consider the following example: Consider the following example:
Given a set of data servers with the following device IDs: Suppose we have a device address consisting of seven data servers,
arranged in three equivalence (Section 13.6) classes:
1->{simple}; 2->{complex, ds_list=<3, 4>}; 3->{simple}; { A, B, C, D }, { E }, { F, G }
4->{simple}; 5->{simple}; 6->{complex, ds_list=<1, 5>};
Device IDs 1, 3, 4 and 5 identify FILEDS4_SIMPLE data servers. Where A through G are network addresses.
Device ID 2 is a FILEDS4_COMPLEX data server constructed of
FILEDS4_SIMPLE data servers 3 and 4. Device ID 6 is a
FILEDS4_COMPLEX data server constructed of FILEDS4_SIMPLE data
servers 4, 1, and 5.
Within an instance of nfsv4_1_file_layout4, imagine a nfl_ds_fh_list Then
constructed of <device ID, device index, FH> tuples: nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G }
ds_fh_list = [<6, 1, 0x17>, <1, 0, 0x12>, <5, 0, 0x22>, i.e.
<2, 0, 0x13>, <3, 0, 0x14>, <4, 0, 0x15>]
And a nfl_stripe_indices array containing the following indices: nflda_multipath_ds_list[0] = { A, B, C, D }
nfl_stripe_indices = [5, 2, 4, 0, 1, 3] nflda_multipath_ds_list[1] = { E }
Using nfl_stripe_indices as indices into the nfl_ds_fh_list, we get nflda_multipath_ds_list[2] = { F, G }
the following re-ordered list of nfsv4_1_file_layout_devfh4 values:
[<4, 0, 0x15>, <5, 0, 0x22>, <2, 0, 0x13>, Suppose the striping index array is:
<6, 3, 0x17>, <1, 0, 0x12>, <5, 0, 0x22>]
Converting the FILEDS4_COMPLEX devices to FILEDS4_SIMPLE devices nflda_stripe_indices<> = { 2, 0, 1, 0 }
gives us the following list of 9 FILEDS4_SIMPLE <device ID, FH>
tuples.
[<4, 0x15>, <5, 0x22>, <3, 0x13>, <4, 0x13>, Now suppose the client gets a layout which has a device ID that maps
<1, 0x17>, <5, 0x17>, <4, 0x14>, <1, 0x12>, to the above device address. The initial index,
<5, 0x22>]
The above list of tuples fully describes the striping pattern. We nfl_first_stripe_index = 2,
observe several things. First, the tuples are not 3-tuples; they do
not have an index value because FILEDS4_SIMPLE devices do not use the
index. Second, each tuple in the sequence represents a destination
for each stripe unit in the pattern. Third, device 2 is a
FILEDS4_COMPLEX device that gets replaced with devices 3 and 4.
Fourth, device 6 is a FILEDS4_COMPLEX device that gets replaced with
devices 1, 5, 4 (and not in the order 4, 1, 5, because the
nfl_ds_fh_list entry for device 6 has a non-zero index value 1, so we
start with second simple device that device 6 maps to and wrap around
to the first simple device after processing the third simple device
that device 6 maps to). Fifth, when converting from FILEDS4_COMPLEX
to FILEDS4_SIMPLE, the filehandle in the FILEDS4_SIMPLE entries that
replace a FILEDS4_COMPLEX entry is from the replaced FILEDS4_COMPLEX
entry. As a result the striping pattern can have the same device ID
appear multiple times, and with different filehandles.
The flattened data server list specifies the pattern over which the and
devices must be striped and over which data is written (in increments
of the stripe unit size). It also specifies the filehandle to be
used for each stripe unit of the pattern. A data server that has
more than one stripe unit of a pattern to store each unit may store
those stripes in different files, but to do so, will need unique
filehandles in the data server list, as the previous example showed.
While data servers may be repeated multiple times within the
flattened data server list, if a STRIPE4_DENSE stripe type is used
(see Section 13.5), the same filehandle MUST NOT be used on the same
data server for different stripe units of the same file.
A data file stored on a data server MUST map to a single file as nfl_fh_list = { 0x36, 0x87, 0x67 }.
defined by the metadata server; i.e., data from two files as viewed
by the metadata server MUST NOT be stored within the same data file If the client wants to write to SU0, the set of valid { network
on any data server. address, file handle } combinations for SUi are determined by:
nfl_first_stripe_index = 2
So
idx = nflda_stripe_indices[(0 + 2) % 4]
= nflda_stripe_indices[2]
= 1
So
nflda_multipath_ds_list[1] = { E }
and
nfl_fh_list[1] = { 0x87 }
The client can thus write SU0 to { 0x87, { E }, }.
The destinations of the first thirteen storage units are:
+-----+------------+--------------+
| SUi | filehandle | data servers |
+-----+------------+--------------+
| 0 | 87 | E |
| 1 | 36 | A,B,C,D |
| 2 | 67 | F,G |
| 3 | 36 | A,B,C,D |
| 4 | 87 | E |
| 5 | 36 | A,B,C,D |
| 6 | 67 | F,G |
| 7 | 36 | A,B,C,D |
| 8 | 87 | E |
| 9 | 36 | A,B,C,D |
| 10 | 67 | F,G |
| 11 | 36 | A,B,C,D |
| 12 | 87 | E |
+-----+------------+--------------+
13.5. Sparse and Dense Stripe Unit Packing 13.5. Sparse and Dense Stripe Unit Packing
The nfl_stripe_type field specifies how the data is packed within the The flag NFL4_UFLG_DENSE of the nfl_util4 data type (field nflh_util
data file on a data server. It allows for two different data of the data type nfsv4_1_file_layouthint4 and field nfl_util of data
packings: STRIPE4_SPARSE and STRIPE4_DENSE. The stripe type type nfsv4_1_file_layout_ds_addr4) specifies how the data is packed
determines the calculation that must be made to map the client within the data file on a data server. It allows for two different
visible file offset to the offset within the data file located on the data packings: sparse and dense. The packing type determines the
data server. calculation that must be made to map the client visible file offset
to the offset within the data file located on the data server.
STRIPE4_SPARSE merely means that the logical offsets of the file as If nfl_util & NFL4_UFLG_DENSE is zero, this means that sparse packing
viewed by a client issuing READs and WRITEs directly to the metadata is being used. Hence the logical offsets of the file as viewed by a
server are the same offsets each data server uses when storing a client issuing READs and WRITEs directly to the metadata server are
stripe unit. The effect then, for striping patterns consisting of at the same offsets each data server uses when storing a stripe unit.
least two stripe units, is for each data server file to be sparse or The effect then, for striping patterns consisting of at least two
holey. So for example, suppose a pattern with 3 stripe units, the stripe units, is for each data server file to be sparse or holey. So
stripe unit size is a block of 4 kilobytes, there are 3 data servers for example, suppose a pattern with three stripe units, the stripe
in the pattern, then the file in data server 1 will have blocks 0, 3, unit size is a 4096 octets, and there are three data servers in the
6, 9, ... filled, data server 2's file will have blocks 1, 4, 7, 10, pattern, then the file in data server 1 will have stripe units 0, 3,
... filled, and data server 3's file will have blocks 2, 5, 8, 11, 6, 9, ... filled, data server 2's file will have stripe units 1, 4,
... filled. The unfilled blocks of each file will be holes, hence 7, 10, ... filled, and data server 3's file will have stripe units 2,
the files in each data server are sparse. Logical blocks 0, 3, 6, 5, 8, 11, ... filled. The unfilled stripe units of each file will be
... of the file would exist as physical blocks 0, 3, 6 on data server holes, hence the files in each data server are sparse.
1, logical blocks 1, 4, 7, ... would exists as physical blocks 1, 4,
7 on data server 2, and logical blocks 2, 5, 8, ... would exist as
physical blocks 2, 5, 8 on data server 3.
The STRIPE4_SPARSE stripe type has holes for the octet ranges not If sparse packing is being used and a client attempts I/O to one of
exported by that data server, thereby allowing pNFS clients to use the holes, then an error MUST be returned by the data server. Using
the real offset into the data server's file, regardless of the data the above example, if data server 3 received a READ or WRITE request
server's position within the pattern. However, if a client attempts for block 4, the data server would return NFS4ERR_PNFS_IO_HOLE. Thus
I/O to one of the holes, then an error MUST be returned by the data data servers need to understand the striping pattern in order to
server. Using the above example, if data server 2 received a READ or support sparse packing.
WRITE request for block 4, the data server would return
NFS4ERR_PNFS_IO_HOLE. Thus data servers need to understand the
striping pattern in order to support STRIPE4_SPARSE layouts.
STRIPE4_DENSE means that the data server files have no holes. If nfl_util & NFL4_UFLG_DENSE is one, this means that that dense
STRIPE4_DENSE might be selected because the data server does not packing is being used and the data server files have no holes. Dense
(efficiently) support holey files, e.g. the data server's file system packing might be selected because the data server does not
allocates storage in the gaps, making STRIPE4_SPARSE a waste of (efficiently) support holey files, or because the data server cannot
space. If the STRIPE4_DENSE stripe type is indicated in the layout, recognize read-ahead unless there are no holes. If the densing
the data files must be packed. Using the example striping pattern packing is indicated in the layout, the data files must be packed.
and stripe unit size that was used for the STRIPE4_SPARSE example, Using the example striping pattern and stripe unit size that was used
the STRIPE4_DENSE example would have all data servers' data files for the sparse packing example, the corresponding dense packing would
blocks, 0, 1, 2, 3, 4, ... filled. Logical blocks 0, 3, 6, ... of have all stripe units of all data files filled. Logical stripe units
the file would live on blocks 0, 1, 2, ... of the file of data server 0, 3, 6, ... of the file would live on stripe units 0, 1, 2, ... of
1, logical blocks 1, 4, 7, ... of the file would live on blocks 0, 1, the file of data server 1, logical stripe units 1, 4, 7, ... of the
2, ... of the file of data server 2, and logical blocks 2, 5, 8, ... file would live on stripe units 0, 1, 2, ... of the file of data
of the file would live on blocks 0, 1, 2, ... of the file of data server 2, and logical stripe units 2, 5, 8, ... of the file would
server 3. live on stripe units 0, 1, 2, ... of the file of data server 3.
Since the STRIPE4_DENSE layout does not leave holes on the data Since the dense packing does not leave holes on the data servers, the
servers, the pNFS client is allowed to write to any offset of any pNFS client is allowed to write to any offset of any data file of any
data file of any data server in the stripe. Thus the the data data server in the stripe. Thus the the data servers need not know
servers need not know the file's striping pattern. the file's striping pattern.
The calculation to determine the octet offset within the data file The calculation to determine the octet offset within the data file
for dense data server layouts is: for dense data server layouts is:
stripe_width = stripe_unit_size * N; stripe_width = stripe_unit_size * N;
where N = number of <data server, filehandle pairs> where N = number of elements in nflda_stripe_indices.
in flattened nfl_ds_fh_list
data_file_offset = floor(file_offset / stripe_width) data_file_offset = floor(file_offset / stripe_width)
* stripe_unit_size * stripe_unit_size
+ file_offset % stripe_unit_size + file_offset % stripe_unit_size
Regardless of the data server layout, the calculation to determine
the index into the device array is the same:
data_server_idx = floor(file_offset / stripe_unit_size) mod N
Section 13.12 describe the semantics for dealing with reads to holes
within the striped file. This is of particular concern, since each
individual component stripe file (i.e., the component of the striped
file that lives on a particular data server) may be of different
length. Thus, clients may experience 'short' reads when reading off
the end of one of these component files.
13.6. Data Server Multipathing 13.6. Data Server Multipathing
The NFSv4.1 file layout supports multipathing to "equivalent" The NFSv4.1 file layout supports multipathing to "equivalent"
(defined later in this section) data servers. Data server-level (defined later in this section) data servers. Data server-level
multipathing is primarily of use in the case of a data server multipathing is used for bandwidth scaling via trunking
failure; it allows the client to switch to another data server that (Section 2.10.4) and for higher availability of use in the case of a
is exporting the same data stripe unit, without having to contact the data server failure. Multipathing allows the client to switch to
metadata server for a new layout. another data server that is exporting the same data stripe unit,
without having to contact the metadata server for a new layout.
To support data server multipathing, there is an array of data server To support data server multipathing, each element of the
network addresses (nflda_simp_ds_list) within the FILEDS4_SIMPLE case nfla_multipath_ds_list contains an array of one more data server
of the nfsv4_1_file_layout_ds_addr4 switched union. This array network addresses. This array (data type multipatch_list4)
represents an ordered list of data server (each identified by a represents a list of data servers (each identified by a network
network address) where the first element has the highest priority. address). Each data server in the list MUST be equivalent (as
Each data server in the list MUST be equivalent to every other data defined in the next paragraph) to every other data server in the
server in the list and each data server MUST be attempted in the list.
order specified.
Two data servers are equivalent if they export the same system image Two data servers are equivalent if an EXCHANGE_ID issued to each data
(e.g., the stateids and filehandles that they use are the same) and server indicates the equivalency as described in Section 2.10.4.
provide the same consistency guarantees. Two equivalent data servers Section 2.10.4 also describes conditions that allow a session created
must also have sufficient connections to the storage, such that from one data server might be usable on another data server.
writing to one data server is equivalent to writing to another; this
also applies to reading. Also, if multiple copies of the same data
exist, reading from one must provide access to all existing copies.
As such, it is unlikely that multipathing will provide additional
benefit in the case of an I/O error.
[[Comment.11: [NOTE: the error cases in which a client is expected to Two equivalent data servers must also have sufficient connections to
attempt an equivalent data server should be specified.]]] the storage, such that writing to one data server is equivalent to
writing to another; this also applies to reading. Also, if multiple
copies of the same data exist, reading from one must provide access
to all existing copies. As such, it is unlikely that multipathing
will provide additional benefit in the case of an I/O error.
13.7. Operations Issued to NFSv4.1 Data Servers 13.7. Operations Issued to NFSv4.1 Data Servers
Clients MUST use the filehandle described within the layout when Clients MUST use the filehandle described within the layout when
accessing data on NFSv4.1 data servers. When using the layout's accessing data on NFSv4.1 data servers. When using the layout's
filehandle, the client MUST only issue the NULL procedure and the filehandle, the client MUST only issue the NULL procedure and the
COMPOUND procedure's BACKCHANNEL_CTL, BIND_CONN_TO_SESSION, COMPOUND procedure's BACKCHANNEL_CTL, BIND_CONN_TO_SESSION,
CREATE_SESSION, COMMIT, DESTROY_CLIENTID, DESTROY_SESSION, CREATE_SESSION, COMMIT, DESTROY_CLIENTID, DESTROY_SESSION,
EXCHANGE_ID, READ, WRITE, PUTFH, SECINFO_NO_NAME, SET_SSV, and EXCHANGE_ID, READ, WRITE, PUTFH, SECINFO_NO_NAME, SET_SSV, and
SEQUENCE operations to the NFSv4.1 data server associated with that SEQUENCE operations to the NFSv4.1 data server associated with that
data server filehandle. If a client issues an operation to the data data server filehandle. If a client issues an operation to the data
server other than those specified above, using the filehandle and server other than those specified above, using the filehandle and
data server listed in the file's layout, that data server MUST return data server listed in the file's layout, that data server MUST return
an error to the client (unless the pNFS server has chosen to not NFS4ERR_NOTSUPP to the client, unless the server's EXCHANGE_ID
disambiguate the data server filehandle from the metadata server results returned (EXCHGID4_FLAG_USE_PNFS_DS |
filehandle, and/or the pNFS server has chosen to not disambiguate the EXCHGID4_FLAG_USE_PNFS_MDS) or (EXCHGID4_FLAG_USE_PNFS_DS |
metadata server client ID from the data server client ID). The EXCHGID4_FLAG_USE_NON_PNFS), see Section 13.1 As described in
client MUST follow the instruction implied by the layout (i.e., which Section 12.5.1, a client MUST NOT issue an I/O to a data server for
filehandles to use on which data servers). As described in which it does not hold a valid layout; the data server MUST reject
Section 12.5.1, a client MUST NOT issue I/Os to data servers for such an I/O. [[Comment.11: We should discuss mixing of MDS and DS
which it does not hold a valid layout. The data servers MAY reject operations in the same compound, but need to find other places where
such requests. this is discussed and move that discussion here.]]
GETATTR and SETATTR MUST be directed to the metadata server. In the GETATTR and SETATTR MUST be directed to the metadata server. In the
case of a SETATTR of the size attribute, the control protocol is case of a SETATTR of the size attribute, the control protocol is
responsible for propagating size updates/truncations to the data responsible for propagating size updates/truncations to the data
servers. In the case of extending WRITEs to the data servers, the servers. In the case of extending WRITEs to the data servers, the
new size must be visible on the metadata server once a LAYOUTCOMMIT new size must be visible on the metadata server once a LAYOUTCOMMIT
has completed (see Section 12.5.3.2). Section 13.12, describes the has completed (see Section 12.5.3.2). Section 13.11, describes the
mechanism by which the client is to handle data server files that do mechanism by which the client is to handle data server files that do
not reflect the metadata server's size. not reflect the metadata server's size.
13.8. COMMIT Through Metadata Server 13.8. COMMIT Through Metadata Server
The nfl_commit_through_mds field in the file layout (data type The flag NFL4_UFLG_COMMIT_THRU_MDS in the field nfl_util of the file
nfsv4_1_file_layout4) gives the metadata server the preferred way of layout (data type nfsv4_1_file_layout4) or the field nflh_util of the
layout hint (data type nfsv4_1_file_layouthint4) is an indication
from the metadata server to the client of the preferred way of
performing COMMIT. If this field is TRUE, the client SHOULD send performing COMMIT. If this field is TRUE, the client SHOULD send
COMMIT to the metadata server instead of sending it to the same data COMMIT to the metadata server instead of sending it to the same data
server to which the associated WRITEs were sent. In order to server to which the associated WRITEs were sent. If nfl_util &
maintain the current NFSv4.1 commit and recovery model, all the data NFL4_UFLG_COMMIT_THRU_MDS is TRUE, then in order to maintain the
servers MUST return a common writeverf verifier in all WRITE current NFSv4.1 commit and recovery model, the data servers MUST
responses for a given file layout. The value of the writeverf return a common writeverf verifier in all WRITE responses for a given
verifier MUST be changed at the metadata server or any data server file layout, and the metadata server's COMMIT implementation must
that is referenced in the layout, whenever there is a server event return the same writeverf. The value of the writeverf verifier MUST
that can possibly lead to loss of uncommitted data. The scope of the be changed at the metadata server or any data server that is
verifier can be for a file or for the entire pNFS server. It might referenced in the layout, whenever there is a server event that can
be more difficult for the server to maintain the verifier at the file possibly lead to loss of uncommitted data. The scope of the verifier
level but the benefit is that only events that impact a given file can be for a file or for the entire pNFS server. It might be more
will require recovery action. difficult for the server to maintain the verifier at the file level
but the benefit is that only events that impact a given file will
require recovery action. [[Comment.12: Trond comments: The whole
section needs justification: there is no discussion anywhere of why a
server implementor might prefer to receive COMMIT requests through
the MDS. Why couldn't such a server simply perform an implicit
COMMIT on LAYOUTCOMMIT since the latter has persistence guarantees
anyway on the metadata? Marc Eshel to propose text.]]
Note that if the layout specified densing packing, then the offset
used to a COMMIT to the MDS may differ than that of an offset used to
a COMMIT to the data server.
The single COMMIT to the metadata server will return a verifier and The single COMMIT to the metadata server will return a verifier and
the client should compare it to all the verifiers from the WRITEs and the client should compare it to all the verifiers from the WRITEs and
fail the COMMIT if there is any mismatched verifiers. If COMMIT to fail the COMMIT if there is any mismatched verifiers. If COMMIT to
the metadata server fails, the client should reissue WRITEs for all the metadata server fails, the client should reissue WRITEs for all
the modified data in the file. The client should treat modified data the modified data in the file. The client should treat modified data
with a mismatched verifier as a WRITE failure and try to recover by with a mismatched verifier as a WRITE failure and try to recover by
reissuing the WRITEs to the original data server or using another reissuing the WRITEs to the original data server or using another
path to that data if the layout has not been recalled. Another path to that data if the layout has not been recalled. Another
option the client has is getting a new layout or just rewrite the option the client has is getting a new layout or just rewrite the
data through the metadata server. If the flag nfl_commit_through_mds data through the metadata server. If nfl_util &
is FALSE, the client should not send COMMIT to the metadata server. NFL4_UFLG_COMMIT_THRU_MDS is FALSE, sending a COMMIT to the metadata
Although it is valid to send COMMIT to the metadata server it should server might have no effect. If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS
be used only to commit data that was written through the metadata is FALSE, a COMMIT sent to the metadata server should be used only to
server. See Section 12.7.6 for recovery options. commit data that was written to the metadata server. See
Section 12.7.6 for recovery options.
13.9. Global Stateid Requirements
Note, there are no stateids embedded within the layout returned by
the metadata server to the pNFS client. The client uses a stateid
returned previously by the metadata server (including results from
OPEN -- a delegation stateid is acceptable as well as a non-
delegation stateid -- lock operations, WANT_DELEGATION, and also from
the CB_PUSH_DELEG callback operation) or a special stateid to perform
I/O on the data servers, as in regular NFSv4.1. Special stateid
usage for I/O is subject to the NFSv4.1 protocol specification. The
stateid used for I/O MUST have the same effect and be subject to the
same validation on data server as it would if the I/O was being
performed on the metadata server itself in the absence of pNFS. This
has the implication that stateids are globally valid on both the
metadata and data servers. This requires the metadata server to
propagate changes in lock and open state to the data servers, so that
the data servers can validate I/O accesses. This is discussed
further in Section 13.11. Depending on when stateids are propagated,
the existence of a valid stateid on the data server may act as proof
of a valid layout.
13.10. The Layout Iomode 13.9. The Layout Iomode
The layout iomode need not be used by the metadata server when The layout iomode need not be used by the metadata server when
servicing NFSv4.1 file-based layouts, although in some circumstances servicing NFSv4.1 file-based layouts, although in some circumstances
it may be useful to use. For example, if the server implementation it may be useful. For example, if the server implementation supports
supports reading from read-only replicas or mirrors, it would be reading from read-only replicas or mirrors, it would be useful for
useful for the server to return a layout enabling the client to do the server to return a layout enabling the client to do so. As such,
so. As such, the client SHOULD set the iomode based on its intent to the client SHOULD set the iomode based on its intent to read or write
read or write the data. The client may default to an iomode of the data. The client may default to an iomode of LAYOUTIOMODE4_RW.
LAYOUTIOMODE4_RW. The iomode need not be checked by the data servers The iomode need not be checked by the data servers when clients
when clients perform I/O. However, the data servers SHOULD still perform I/O. However, the data servers SHOULD still validate that the
validate that the client holds a valid layout and return an error if client holds a valid layout and return an error if the client does
the client does not. not.
13.11. Data Server State Propagation 13.10. Metadata and Data Server State Coordination
13.10.1. Global Stateid Requirements
Note, there are no stateids embedded within the layout returned by
the metadata server to the pNFS client. When the client issues I/O
to a data server, the stateid used must be one previously returned by
the metadata server. Permitted stateids include an open stateid (the
stateid field of data type OPEN4resok as returned by OPEN), a
delegation stateid (the stateid field of data types
open_read_delegation4 and open_write_delegation4 as returned by OPEN
or WANT_DELEGATION, or as sent by CB_PUSH_DELEG), or a stateid
returned by the LOCK or LOCKU operations. The stateid used for I/O
MUST have the same effect and be subject to the same validation on a
data server as it would if the I/O was being performed on the
metadata server itself in the absence of pNFS. This has the
implication that stateids are globally valid on both the metadata and
data servers. This requires the metadata server to propagate changes
in lock and open state to the data servers, so that the data servers
can validate I/O accesses. This is discussed further in
Section 13.10.2. Depending on when stateids are propagated, the
existence of a valid stateid on the data server may act as proof of a
valid layout.
13.10.2. Data Server State Propagation
Since the metadata server, which handles lock and open-mode state Since the metadata server, which handles lock and open-mode state
changes, as well as ACLs, may not be co-located with the data servers changes, as well as ACLs, may not be co-located with the data servers
where I/O access are validated, as such, the server implementation where I/O access are validated, the server implementation MUST take
MUST take care of propagating changes of this state to the data care of propagating changes of this state to the data servers. Once
servers. Once the propagation to the data servers is complete, the the propagation to the data servers is complete, the full effect of
full effect of those changes must be in effect at the data servers. those changes MUST be in effect at the data servers. However, some
However, some state changes need not be propagated immediately, state changes need not be propagated immediately, although all
although all changes SHOULD be propagated promptly. These state changes SHOULD be propagated promptly. These state propagations have
propagations have an impact on the design of the control protocol, an impact on the design of the control protocol, even though the
even though the control protocol is outside of the scope of this control protocol is outside of the scope of this specification.
specification. Immediate propagation refers to the synchronous Immediate propagation refers to the synchronous propagation of state
propagation of state from the metadata server to the data server(s); from the metadata server to the data server(s); the propagation must
the propagation must be complete before returning to the client. be complete before returning to the client.
13.11.1. Lock State Propagation 13.10.2.1. Lock State Propagation
If the pNFS server supports mandatory locking, any mandatory locks on If the pNFS server supports mandatory locking, any mandatory locks on
a file MUST be made effective at the data servers before the request a file MUST be made effective at the data servers before the request
that establishes them returns to the caller. Thus, mandatory lock that establishes them returns to the caller. Thus, mandatory lock
state MUST be synchronously propagated to the data servers. On the state MUST be synchronously propagated to the data servers. On the
other hand, since advisory lock state is not used for checking I/O other hand, since advisory lock state is not used for checking I/O
accesses at the data servers, there is no semantic reason for accesses at the data servers, there is no semantic reason for
propagating advisory lock state to the data servers. However, since propagating advisory lock state to the data servers. However, since
all lock, unlock, open downgrades and upgrades MAY affect the "seqid" all lock, unlock, open downgrades and upgrades MAY affect the "seqid"
stored within the stateid (see Section 8.1.3.1), the stateid changes stored within the stateid (see Section 8.1.3.1), the stateid changes
skipping to change at page 274, line 42 skipping to change at page 275, line 32
Since updates to advisory locks neither confer nor remove privileges, Since updates to advisory locks neither confer nor remove privileges,
these changes need not be propagated immediately, and may not need to these changes need not be propagated immediately, and may not need to
be propagated promptly. The updates to advisory locks need only be be propagated promptly. The updates to advisory locks need only be
propagated when the data server needs to resolve a question about a propagated when the data server needs to resolve a question about a
stateid. In fact, if record locking is not mandatory (i.e., is stateid. In fact, if record locking is not mandatory (i.e., is
advisory) the clients are advised not to use the lock-based stateids advisory) the clients are advised not to use the lock-based stateids
for I/O at all. The stateids returned by open are sufficient and for I/O at all. The stateids returned by open are sufficient and
eliminate overhead for this kind of state propagation. eliminate overhead for this kind of state propagation.
13.11.2. Open-mode Validation If a client gets back an NFS4ERR_LOCKED error from a data server,
this is an indication that mandatory record locking is in force. The
client recovers from this by getting a record lock that covers the
affected range and reissues the I/O with the stateid of the record
lock.
Open-mode validation MUST be performed against the open mode(s) held 13.10.2.2. Open and Deny Mode Validation
by the data servers. However, the server implementation may not
always require the immediate propagation of changes. Reduction in
access because of CLOSEs or DOWNGRADEs does not have to be propagated
immediately, but SHOULD be propagated promptly; whereas changes due
to revocation MUST be propagated immediately. On the other hand,
changes that expand access (e.g., new OPEN's and upgrades) do not
have to be propagated immediately but the data server SHOULD NOT
reject a request because of open mode issues without making sure that
the upgrade is not in flight.
13.11.3. File Attributes Open and deny mode validation MUST be performed against the open and
deny mode(s) held by the data servers. When access is reduced or a
deny mode made more restrictive (because of CLOSE or DOWNGRADE) the
data server MUST prevent any I/Os that would be denied if performed
on the metadata server. Conversely, when access is expanded, the
data server MUST NOT reject a request because of open or deny issues
without making sure that the open mode upgrade or deny mode
relaxation is not in progress. [[Comment.13: It seems like the
recent (June/July 2007) stateid.seqid discussion might modify this; a
client might send a new state.seqid, that tells the data server it's
seqid is out of date; the data server might return NFS4ERR_DELAY.]]
13.10.2.3. File Attributes
Since the SETATTR operation has the ability to modify state that is Since the SETATTR operation has the ability to modify state that is
visible on both the metadata and data servers (e.g., the size), care visible on both the metadata and data servers (e.g., the size), care
must be taken to ensure that the resultant state across the set of must be taken to ensure that the resultant state across the set of
data servers is consistent; especially when truncating or growing the data servers is consistent; especially when truncating or growing the
file. file.
As described earlier, the LAYOUTCOMMIT operation is used to ensure As described earlier, the LAYOUTCOMMIT operation is used to ensure
that the metadata is synchronized with changes made to the data that the metadata is synchronized with changes made to the data
servers. For the NFSv4.1-based data storage protocol, it is servers. For the NFSv4.1-based data storage protocol, it is
skipping to change at page 275, line 35 skipping to change at page 276, line 33
Any changes to file attributes that control authorization or access Any changes to file attributes that control authorization or access
as reflected by ACCESS calls or READs and WRITEs on the metadata as reflected by ACCESS calls or READs and WRITEs on the metadata
server, MUST be propagated to the data servers for enforcement on server, MUST be propagated to the data servers for enforcement on
READ and WRITE I/O calls. If the changes made on the metadata server READ and WRITE I/O calls. If the changes made on the metadata server
result in more restrictive access permissions for any user, those result in more restrictive access permissions for any user, those
changes MUST be propagated to the data servers synchronously. changes MUST be propagated to the data servers synchronously.
The OPEN operation (Section 17.16.5) does not impose any requirement The OPEN operation (Section 17.16.5) does not impose any requirement
that I/O operations on an open file have the same credentials as the that I/O operations on an open file have the same credentials as the
OPEN itself, and so requires the server's READ and WRITE operations OPEN itself (unless EXCHGID4_FLAG_BIND_PRINC_STATEID is set when
to perform appropriate access checking. Changes to ACLs also require EXCHANGE_ID creates the client ID) and so requires the server's READ
new access checking by READ and WRITE on the server. The propagation and WRITE operations to perform appropriate access checking. Changes
of access right changes due to changes in ACLs may be asynchronous to ACLs also require new access checking by READ and WRITE on the
only if the server implementation is able to determine that the server. The propagation of access right changes due to changes in
updated ACL is not more restrictive for any user specified in the old ACLs may be asynchronous only if the server implementation is able to
ACL. Due to the relative infrequency of ACL updates, it is suggested determine that the updated ACL is not more restrictive for any user
that all changes be propagated synchronously. specified in the old ACL. Due to the relative infrequency of ACL
updates, it is suggested that all changes be propagated
synchronously.
13.12. Data Server Component File Size 13.11. Data Server Component File Size
A potential problem exists when a component data file on a particular A potential problem exists when a component data file on a particular
data server is grown past EOF; the problem exists for both dense and data server is grown past EOF; the problem exists for both dense and
sparse layouts. Imagine the following scenario: a client creates a sparse layouts. Imagine the following scenario: a client creates a
new file (size == 0) and writes to octet 131072; the client then new file (size == 0) and writes to octet 131072; the client then
seeks to the beginning of the file and reads octet 100. The client seeks to the beginning of the file and reads octet 100. The client
should receive 0s back as a result of the READ. However, if the READ should receive 0s back as a result of the READ. However, if the READ
falls on a data server different than that that received client's falls on a data server other than the one that received client's
original WRITE, the data server servicing the READ may still believe original WRITE, the data server servicing the READ may still believe
that the file's size is at 0 and return no data with the EOF flag that the file's size is at 0 and return no data with the EOF flag
set. The data server can only return 0s if it knows that the file's set. The data server can only return 0s if it knows that the file's
size has been extended. This would require the immediate propagation size has been extended. This would require the immediate propagation
of the file's size to all data servers, which is potentially very of the file's size to all data servers, which is potentially very
costly. Therefore, the client that has initiated the extension of costly. Therefore, the client that has initiated the extension of
the file's size MUST be prepared to deal with these EOF conditions; the file's size MUST be prepared to deal with these EOF conditions;
the EOF'ed or short READs will be treated as a hole in the file and the EOF'ed or short READs will be treated as a hole in the file and
the NFS client will substitute 0s for the data when the offset is the NFS client will substitute 0s for the data when the offset is
less than the client's view of the file size. less than the client's view of the file size.
skipping to change at page 276, line 26 skipping to change at page 277, line 25
The NFSv4.1 protocol only provides close to open file data cache The NFSv4.1 protocol only provides close to open file data cache
semantics; meaning that when the file is closed all modified data is semantics; meaning that when the file is closed all modified data is
written to the server. When a subsequent OPEN of the file is done, written to the server. When a subsequent OPEN of the file is done,
the change attribute is inspected for a difference from a cached the change attribute is inspected for a difference from a cached
value for the change attribute. For the case above, this means that value for the change attribute. For the case above, this means that
a LAYOUTCOMMIT will be done at close (along with the data WRITEs) and a LAYOUTCOMMIT will be done at close (along with the data WRITEs) and
will update the file's size and change attribute. Access from will update the file's size and change attribute. Access from
another client after that point will result in the appropriate size another client after that point will result in the appropriate size
being returned. being returned.
13.13. Recovery Considerations 13.12. Recovery from Loss of Layout
As described in Section 12.7, the layout type-specific storage As described in Section 12.7, the layout type-specific storage
protocol is responsible for handling the effects of I/Os started protocol is responsible for handling the effects of I/Os started
before lease expiration, extending through lease expiration. The before lease expiration, extending through lease expiration. The
NFSv4.1 file layout type prevents all I/Os from being executed after NFSv4.1 file layout type prevents all I/Os from being executed after
lease expiration, without relying on a precise client lease timer and lease expiration, without relying on a precise client lease timer and
without requiring data servers to maintain lease timers. without requiring data servers to maintain lease timers.
It works as follows. As described in Section 13.1, in COMPOUND It works as follows. As described in Section 13.1, in COMPOUND
procedure requests to the data server, the data filehandle provided procedure requests to the data server, the data filehandle provided
by the PUTFH operation and the stateid in the READ or WRITE operation by the PUTFH operation and the stateid in the READ or WRITE operation
are used to validate that the client has a valid layout for the I/O are used to validate that the client has a valid layout for the I/O
being performed, if it does not, the I/O is rejected. Before the being performed, if it does not, the I/O is rejected with
metadata server takes any action to invalidate a layout given out by NFS4ERR_PNFS_NO_LAYOUT. Before the metadata server takes any action
a previous instance, it must make sure that all layouts from that to invalidate layout state given out by a previous instance, it must
previous instance are invalidated at the data servers. make sure that all layout state from that previous instance are
invalidated at the data servers.
This means that a metadata server may not restripe a file until it This means that a metadata server may not restripe a file until it
has contacted all of the data servers to invalidate the layouts from has contacted all of the data servers to invalidate the layouts from
the previous instance nor may it give out mandatory locks that the previous instance nor may it give out mandatory locks that
conflict with layouts from the previous instance without either doing conflict with layouts from the previous instance without either doing
a specific invalidation (as it would have to do anyway) or doing a a specific invalidation (as it would have to do anyway) or doing a
global data server invalidation. global data server invalidation.
13.14. Security Considerations for the File Layout Type 13.13. Security Considerations for the File Layout Type
The NFSv4.1 file layout type MUST adhere to the security The NFSv4.1 file layout type MUST adhere to the security
considerations outlined in Section 12.9. NFSv4.1 data servers must considerations outlined in Section 12.9. NFSv4.1 data servers MUST
make all of the required access checks on each READ or WRITE I/O as make all of the required access checks on each READ or WRITE I/O as
determined by the NFSv4.1 protocol. If the metadata server would determined by the NFSv4.1 protocol. If the metadata server would
deny READ or WRITE operation on a given file due its ACL, mode deny READ or WRITE operation on a given file due its ACL, mode
attribute, open mode, open deny mode, mandatory lock state, or any attribute, open mode, open deny mode, mandatory lock state, or any
other attributes and state, the data server MUST also deny the READ other attributes and state, the data server MUST also deny the READ
or WRITE operation. This impacts the control protocol and the or WRITE operation. This impacts the control protocol and the
propagation of state from the metadata server to the data servers; propagation of state from the metadata server to the data servers;
see Section 13.11 for more details. see Section 13.10.2 for more details.
The methods for authentication, integrity, and privacy for file The methods for authentication, integrity, and privacy for file
layout-based data servers are the same as that used for metadata layout-based data servers are the same as those used by metadata
servers. Metadata and data servers use ONC RPC security flavors to servers. Metadata and data servers use ONC RPC security flavors to
authenticate, and SECINFO and SECINFO_NO_NAME to negotiate the authenticate, and SECINFO and SECINFO_NO_NAME to negotiate the
security mechanism and services to be used. security mechanism and services to be used.
For a given file object, a metadata server MAY require different For a given file object, a metadata server MAY require different
security parameters (secinfo4 value) than the data server. For a security parameters (secinfo4 value) than the data server. For a
given file object with multiple data servers, the secinfo4 value given file object with multiple data servers, the secinfo4 value
SHOULD be the same across all data servers. SHOULD be the same across all data servers. If the secinfo4 values
across a metadata server and its data servers differ for a specific
file, the mapping of the principal to the server's internal user
identifier MUST be the same in order for the access control checks
based on ACL, mode, open and deny mode, and mandatory locking to be
consistent across on the pNFS server.
If an NFSv4.1 implementation supports pNFS and supports NFSv4.1 file If an NFSv4.1 implementation supports pNFS and supports NFSv4.1 file
layouts, then the implementation MUST support the SECINFO_NO_NAME layouts, then the implementation MUST support the SECINFO_NO_NAME
operation, on both the metadata and data servers. operation, on both the metadata and data servers.
14. Internationalization 14. Internationalization
The primary issue in which NFS version 4 needs to deal with The primary issue in which NFS version 4 needs to deal with
internationalization, or I18N, is with respect to file names and internationalization, or I18N, is with respect to file names and
other strings as used within the protocol. The choice of string other strings as used within the protocol. The choice of string
skipping to change at page 283, line 13 skipping to change at page 284, line 20
Table C.9 Table C.9
14.3.6. Bidirectional output for nfs4_mixed_prep 14.3.6. Bidirectional output for nfs4_mixed_prep
The nfs4_mixed_prep profile specifies checking bidirectional strings The nfs4_mixed_prep profile specifies checking bidirectional strings
as described in stringprep's section 6. as described in stringprep's section 6.
14.4. UTF-8 Related Errors 14.4. UTF-8 Related Errors
Where the client sends an invalid UTF-8 string, the server should Where the client sends an invalid UTF-8 string, the server should
return an NFS4ERR_INVAL (Table 8) error. This includes cases in return an NFS4ERR_INVAL (Table 10) error. This includes cases in
which inappropriate prefixes are detected and where the count which inappropriate prefixes are detected and where the count
includes trailing bytes that do not constitute a full UCS character. includes trailing bytes that do not constitute a full UCS character.
Where the client supplied string is valid UTF-8 but contains Where the client supplied string is valid UTF-8 but contains
characters that are not supported by the server as a value for that characters that are not supported by the server as a value for that
string (e.g. names containing characters that have more than two string (e.g. names containing characters that have more than two
octets on a file system that supports Unicode characters only), the octets on a file system that supports Unicode characters only), the
server should return an NFS4ERR_BADCHAR (Table 8) error. server should return an NFS4ERR_BADCHAR (Table 10) error.
Where a UTF-8 string is used as a file name, and the file system, Where a UTF-8 string is used as a file name, and the file system,
while supporting all of the characters within the name, does not while supporting all of the characters within the name, does not
allow that particular name to be used, the server should return the allow that particular name to be used, the server should return the
error NFS4ERR_BADNAME (Table 8). This includes situations in which error NFS4ERR_BADNAME (Table 10). This includes situations in which
the server file system imposes a normalization constraint on name the server file system imposes a normalization constraint on name
strings, but will also include such situations as file system strings, but will also include such situations as file system
prohibitions of "." and ".." as file names for certain operations, prohibitions of "." and ".." as file names for certain operations,
and other such constraints. and other such constraints.
15. Error Values 15. Error Values
NFS error numbers are assigned to failed operations within a compound NFS error numbers are assigned to failed operations within a compound
request. A compound request contains a number of NFS operations that request. A compound request contains a number of NFS operations that
have their results encoded in sequence in a compound reply. The have their results encoded in sequence in a compound reply. The
skipping to change at page 287, line 35 skipping to change at page 288, line 35
| | | association, but it | | | | association, but it |
| | | disabled enforcement | | | | disabled enforcement |
| | | when the session was | | | | when the session was |
| | | created. | | | | created. |
| NFS4ERR_DEADLOCK | 10045 | The server has been | | NFS4ERR_DEADLOCK | 10045 | The server has been |
| | | able to determine a | | | | able to determine a |
| | | file locking | | | | file locking |
| | | deadlock condition | | | | deadlock condition |
| | | for a blocking lock | | | | for a blocking lock |
| | | request. | | | | request. |
| NFS4ERR_BADSESSION | 10782 | The specified | | NFS4ERR_DEADSESSION | 10782 | The specified |
| | | session is dead and | | | | session is dead and |
| | | does not accept new | | | | does not accept new |
| | | requests. | | | | requests. |
| NFS4ERR_DELAY | 10008 | The server initiated | | NFS4ERR_DELAY | 10008 | The server initiated |
| | | the request, but was | | | | the request, but was |
| | | not able to complete | | | | not able to complete |
| | | it in a timely | | | | it in a timely |
| | | fashion. The client | | | | fashion. The client |
| | | should wait and then | | | | should wait and then |
| | | try the request with | | | | try the request with |
skipping to change at page 293, line 31 skipping to change at page 294, line 31
| | | because previous | | | | because previous |
| | | operations have | | | | operations have |
| | | created a situation | | | | created a situation |
| | | in which the server | | | | in which the server |
| | | is not able to | | | | is not able to |
| | | determine that a | | | | determine that a |
| | | reclaim-interfering | | | | reclaim-interfering |
| | | edge condition does | | | | edge condition does |
| | | not exist. | | | | not exist. |
| NFS4ERR_NOMATCHING_LAYOUT | 10060 | Client has no | | NFS4ERR_NOMATCHING_LAYOUT | 10060 | Client has no |
| | | matching layout | | | | matching layout to |
| | | (segment) to return. | | | | return. |
| NFS4ERR_NOSPC | 28 | No space left on | | NFS4ERR_NOSPC | 28 | No space left on |
| | | device. The | | | | device. The |
| | | operation would have | | | | operation would have |
| | | caused the server's | | | | caused the server's |
| | | file system to | | | | file system to |
| | | exceed its limit. | | | | exceed its limit. |
| NFS4ERR_NOTDIR | 20 | Not a directory. | | NFS4ERR_NOTDIR | 20 | Not a directory. |
| | | The caller specified | | | | The caller specified |
| | | a non-directory in a | | | | a non-directory in a |
| | | directory operation. | | | | directory operation. |
skipping to change at page 295, line 21 skipping to change at page 296, line 21
| | | of the operation. | | | | of the operation. |
| NFS4ERR_PNFS_IO_HOLE | 10075 | The pNFS client has | | NFS4ERR_PNFS_IO_HOLE | 10075 | The pNFS client has |
| | | attempted to read | | | | attempted to read |
| | | from or write to a | | | | from or write to a |
| | | illegal hole of a | | | | illegal hole of a |
| | | file of a data | | | | file of a data |
| | | server that is using | | | | server that is using |
| | | the STRIPE4_SPARSE | | | | the STRIPE4_SPARSE |
| | | stripe type. See | | | | stripe type. See |
| | | Section 13.5. | | | | Section 13.5. |
| NFS4ERR_PNFS_NO_LAYOUT | 10080 | The pNFS client has |
| | | attempted to read |
| | | from or write to a |
| | | file without a valid |
| | | layout. |
| NFS4ERR_RECALLCONFLICT | 10061 | Layout is | | NFS4ERR_RECALLCONFLICT | 10061 | Layout is |
| | | unavailable due to a | | | | unavailable due to a |
| | | conflicting | | | | conflicting |
| | | LAYOUTRECALL that is | | | | LAYOUTRECALL that is |
| | | in progress. | | | | in progress. |
| NFS4ERR_RECLAIM_BAD | 10034 | The reclaim provided | | NFS4ERR_RECLAIM_BAD | 10034 | The reclaim provided |
| | | by the client does | | | | by the client does |
| | | not match any of the | | | | not match any of the |
| | | server's state | | | | server's state |
| | | consistency checks | | | | consistency checks |
skipping to change at page 298, line 35 skipping to change at page 299, line 35
| | | policy. The client | | | | policy. The client |
| | | should change the | | | | should change the |
| | | security mechanism | | | | security mechanism |
| | | being used and retry | | | | being used and retry |
| | | the operation. | | | | the operation. |
| NFS4ERR_XDEV | 18 | Attempt to do an | | NFS4ERR_XDEV | 18 | Attempt to do an |
| | | operation between | | | | operation between |
| | | different fsids. | | | | different fsids. |
+-----------------------------------+--------+----------------------+ +-----------------------------------+--------+----------------------+
Table 8 Table 10
15.2. Operations and their valid errors 15.2. Operations and their valid errors
Mappings of valid error returns for each protocol operation Mappings of valid error returns for each protocol operation
+----------------------+--------------------------------------------+ +----------------------+--------------------------------------------+
| Operation | Errors | | Operation | Errors |
+----------------------+--------------------------------------------+ +----------------------+--------------------------------------------+
| ACCESS | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | | ACCESS | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, |
| | NFS4ERR_BADXDR, NFS4ERR_DELAY, | | | NFS4ERR_BADXDR, NFS4ERR_DELAY, |
| | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, |
skipping to change at page 307, line 14 skipping to change at page 308, line 14
| READ | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | | READ | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, |
| | NFS4ERR_BADHANDLE, NFS4ERR_BAD_STATEID, | | | NFS4ERR_BADHANDLE, NFS4ERR_BAD_STATEID, |
| | NFS4ERR_BADXDR, NFS4ERR_DELAY, | | | NFS4ERR_BADXDR, NFS4ERR_DELAY, |
| | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, |
| | NFS4ERR_GRACE, NFS4ERR_IO, NFS4ERR_INVAL, | | | NFS4ERR_GRACE, NFS4ERR_IO, NFS4ERR_INVAL, |
| | NFS4ERR_ISDIR, NFS4ERR_LEASE_MOVED, | | | NFS4ERR_ISDIR, NFS4ERR_LEASE_MOVED, |
| | NFS4ERR_LOCKED, NFS4ERR_MOVED, | | | NFS4ERR_LOCKED, NFS4ERR_MOVED, |
| | NFS4ERR_NOFILEHANDLE, NFS4ERR_NXIO, | | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NXIO, |
| | NFS4ERR_OP_NOT_IN_SESSION, | | | NFS4ERR_OP_NOT_IN_SESSION, |
| | NFS4ERR_OPENMODE, NFS4ERR_PNFS_IO_HOLE, | | | NFS4ERR_OPENMODE, NFS4ERR_PNFS_IO_HOLE, |
| | NFS4ERR_PNFS_NO_LAYOUT, |
| | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, |
| | NFS4ERR_REP_TOO_BIG, | | | NFS4ERR_REP_TOO_BIG, |
| | NFS4ERR_REP_TOO_BIG_TO_CACHE, | | | NFS4ERR_REP_TOO_BIG_TO_CACHE, |
| | NFS4ERR_UNSAFE_COMPOUND, | | | NFS4ERR_UNSAFE_COMPOUND, |
| | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, |
| | NFS4ERR_STALE_STATEID | | | NFS4ERR_STALE_STATEID |
| READDIR | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, | | READDIR | NFS4ERR_ACCESS, NFS4ERR_BADHANDLE, |
| | NFS4ERR_BAD_COOKIE, NFS4ERR_BADXDR, | | | NFS4ERR_BAD_COOKIE, NFS4ERR_BADXDR, |
| | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, |
| | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, |
skipping to change at page 311, line 15 skipping to change at page 312, line 15
| | NFS4ERR_BADHANDLE, NFS4ERR_BAD_STATEID, | | | NFS4ERR_BADHANDLE, NFS4ERR_BAD_STATEID, |
| | NFS4ERR_BADXDR, NFS4ERR_DELAY, | | | NFS4ERR_BADXDR, NFS4ERR_DELAY, |
| | NFS4ERR_DQUOT, NFS4ERR_EXPIRED, | | | NFS4ERR_DQUOT, NFS4ERR_EXPIRED, |
| | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, |
| | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, |
| | NFS4ERR_ISDIR, NFS4ERR_LEASE_MOVED, | | | NFS4ERR_ISDIR, NFS4ERR_LEASE_MOVED, |
| | NFS4ERR_LOCKED, NFS4ERR_MOVED, | | | NFS4ERR_LOCKED, NFS4ERR_MOVED, |
| | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, |
| | NFS4ERR_NXIO, NFS4ERR_OP_NOT_IN_SESSION, | | | NFS4ERR_NXIO, NFS4ERR_OP_NOT_IN_SESSION, |
| | NFS4ERR_OPENMODE, NFS4ERR_PNFS_IO_HOLE, | | | NFS4ERR_OPENMODE, NFS4ERR_PNFS_IO_HOLE, |
| | NFS4ERR_PNFS_NO_LAYOUT, |
| | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, |
| | NFS4ERR_REP_TOO_BIG, | | | NFS4ERR_REP_TOO_BIG, |
| | NFS4ERR_REP_TOO_BIG_TO_CACHE, | | | NFS4ERR_REP_TOO_BIG_TO_CACHE, |
| | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, |
| | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, |
| | NFS4ERR_STALE_STATEID | | | NFS4ERR_STALE_STATEID |
+----------------------+--------------------------------------------+ +----------------------+--------------------------------------------+
Table 9 Table 11
15.3. Callback operations and their valid errors 15.3. Callback operations and their valid errors
Mappings of valid error returns for each protocol callback operation Mappings of valid error returns for each protocol callback operation
+-------------------------+-----------------------------------------+ +-------------------------+-----------------------------------------+
| Callback Operation | Errors | | Callback Operation | Errors |
+-------------------------+-----------------------------------------+ +-------------------------+-----------------------------------------+
| CB_GETATTR | NFS4ERR_BADHANDLE NFS4ERR_BADXDR | | CB_GETATTR | NFS4ERR_BADHANDLE NFS4ERR_BADXDR |
| | NFS4ERR_OP_NOT_IN_SESSION, | | | NFS4ERR_OP_NOT_IN_SESSION, |
skipping to change at page 313, line 4 skipping to change at page 314, line 4
| | NFS4ERR_BAD_HIGH_SLOT, | | | NFS4ERR_BAD_HIGH_SLOT, |
| | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, |
| | NFS4ERR_SEQ_FALSE_RETRY, | | | NFS4ERR_SEQ_FALSE_RETRY, |
| | NFS4ERR_SEQ_MISORDERED, | | | NFS4ERR_SEQ_MISORDERED, |
| | NFS4ERR_SEQUENCE_POS, | | | NFS4ERR_SEQUENCE_POS, |
| | NFS4ERR_REQ_TOO_BIG, | | | NFS4ERR_REQ_TOO_BIG, |
| | NFS4ERR_TOO_MANY_OPS, | | | NFS4ERR_TOO_MANY_OPS, |
| | NFS4ERR_REP_TOO_BIG, | | | NFS4ERR_REP_TOO_BIG, |
| | NFS4ERR_REP_TOO_BIG_TO_CACHE | | | NFS4ERR_REP_TOO_BIG_TO_CACHE |
+-------------------------+-----------------------------------------+ +-------------------------+-----------------------------------------+
Table 10 Table 12
15.4. Errors and the operations that use them 15.4. Errors and the operations that use them
+-----------------------------------+-------------------------------+ +-----------------------------------+-------------------------------+
| Error | Operations | | Error | Operations |
+-----------------------------------+-------------------------------+ +-----------------------------------+-------------------------------+
| NFS4ERR_ACCESS | ACCESS, COMMIT, CREATE, | | NFS4ERR_ACCESS | ACCESS, COMMIT, CREATE, |
| | GETATTR, GET_DIR_DELEGATION, | | | GETATTR, GET_DIR_DELEGATION, |
| | LINK, LOCK, LOCKT, LOCKU, | | | LINK, LOCK, LOCKT, LOCKU, |
| | LOOKUP, LOOKUPP, NVERIFY, | | | LOOKUP, LOOKUPP, NVERIFY, |
skipping to change at page 317, line 32 skipping to change at page 318, line 32
| | LOOKUPP, NVERIFY, OPEN, | | | LOOKUPP, NVERIFY, OPEN, |
| | OPENATTR, OPEN_DOWNGRADE, | | | OPENATTR, OPEN_DOWNGRADE, |
| | PUTFH, PUTPUBFH, PUTROOTFH, | | | PUTFH, PUTPUBFH, PUTROOTFH, |
| | READ, READDIR, READLINK, | | | READ, READDIR, READLINK, |
| | RELEASE_LOCKOWNER, REMOVE, | | | RELEASE_LOCKOWNER, REMOVE, |
| | RENAME, RESTOREFH, SAVEFH, | | | RENAME, RESTOREFH, SAVEFH, |
| | SECINFO, SECINFO_NO_NAME, | | | SECINFO, SECINFO_NO_NAME, |
| | SETATTR, VERIFY, WRITE | | | SETATTR, VERIFY, WRITE |
| NFS4ERR_PERM | CREATE, OPEN, SETATTR | | NFS4ERR_PERM | CREATE, OPEN, SETATTR |
| NFS4ERR_PNFS_IO_HOLE | READ, WRITE | | NFS4ERR_PNFS_IO_HOLE | READ, WRITE |
| NFS4ERR_PNFS_NO_LAYOUT | READ, WRITE |
| NFS4ERR_RECALLCONFLICT | LAYOUTGET | | NFS4ERR_RECALLCONFLICT | LAYOUTGET |
| NFS4ERR_RECLAIM_BAD | LAYOUTCOMMIT, LOCK, OPEN | | NFS4ERR_RECLAIM_BAD | LAYOUTCOMMIT, LOCK, OPEN |
| NFS4ERR_RECLAIM_CONFLICT | LOCK, OPEN | | NFS4ERR_RECLAIM_CONFLICT | LOCK, OPEN |
| NFS4ERR_REP_TOO_BIG | ACCESS, CB_GETATTR, | | NFS4ERR_REP_TOO_BIG | ACCESS, CB_GETATTR, |
| | CB_RECALL, CB_RECALL_ANY, | | | CB_RECALL, CB_RECALL_ANY, |
| | CB_SEQUENCE, CLOSE, COMMIT, | | | CB_SEQUENCE, CLOSE, COMMIT, |
| | CREATE, DELEGPURGE, | | | CREATE, DELEGPURGE, |
| | DELEGRETURN, GETATTR, GETFH, | | | DELEGRETURN, GETATTR, GETFH, |
| | GET_DIR_DELEGATION, LINK, | | | GET_DIR_DELEGATION, LINK, |
| | LOCK, LOCKT, LOCKU, LOOKUP, | | | LOCK, LOCKT, LOCKU, LOOKUP, |
skipping to change at page 320, line 43 skipping to change at page 321, line 43
| | RENAME, RESTOREFH, SAVEFH, | | | RENAME, RESTOREFH, SAVEFH, |
| | SECINFO, SECINFO_NO_NAME, | | | SECINFO, SECINFO_NO_NAME, |
| | SETATTR, VERIFY, WRITE | | | SETATTR, VERIFY, WRITE |
| NFS4ERR_WRONGSEC | GET_DIR_DELEGATION, LINK, | | NFS4ERR_WRONGSEC | GET_DIR_DELEGATION, LINK, |
| | LOOKUP, LOOKUPP, OPEN, PUTFH, | | | LOOKUP, LOOKUPP, OPEN, PUTFH, |
| | PUTPUBFH, PUTROOTFH, RENAME, | | | PUTPUBFH, PUTROOTFH, RENAME, |
| | RESTOREFH | | | RESTOREFH |
| NFS4ERR_XDEV | LINK, RENAME | | NFS4ERR_XDEV | LINK, RENAME |
+-----------------------------------+-------------------------------+ +-----------------------------------+-------------------------------+
Table 11 Table 13
16. NFS version 4.1 Procedures 16. NFS version 4.1 Procedures
16.1. Procedure 0: NULL - No Operation 16.1. Procedure 0: NULL - No Operation
16.1.1. SYNOPSIS 16.1.1. SYNOPSIS
16.1.2. ARGUMENTS 16.1.2. ARGUMENTS
void; void;
skipping to change at page 324, line 16 skipping to change at page 325, line 16
PUTFH fh1 {fh1} PUTFH fh1 {fh1}
LOOKUP "compA" {fh2} LOOKUP "compA" {fh2}
GETATTR {fh2} GETATTR {fh2}
LOOKUP "compB" {fh3} LOOKUP "compB" {fh3}
GETATTR {fh3} GETATTR {fh3}
LOOKUP "compC" {fh4} LOOKUP "compC" {fh4}
GETATTR {fh4} GETATTR {fh4}
GETFH GETFH
Figure 77 Figure 76
In this example, the PUTFH operation explicitly sets the current file In this example, the PUTFH operation explicitly sets the current file
handle value while the result of each LOOKUP operation sets the handle value while the result of each LOOKUP operation sets the
current file handle value to the resultant file system object. Also, current file handle value to the resultant file system object. Also,
the client is able to insert GETATTR operations using the current the client is able to insert GETATTR operations using the current
file handle as an argument. file handle as an argument.
Along with the current file handle, there is a saved file handle. Along with the