<
 draft-ietf-nfsv4-minorversion1-07.txt   draft-ietf-nfsv4-minorversion1-08.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: February 2, 2007 Editors Expires: April 25, 2007 Editors
August 2006 October 22, 2006
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-07.txt draft-ietf-nfsv4-minorversion1-08.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on February 2, 2007. This Internet-Draft will expire on April 25, 2007.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2006). Copyright (C) The Internet Society (2006).
Abstract Abstract
This Internet-Draft describes NFSv4 minor version one, including This Internet-Draft describes NFSv4 minor version one, including
features retained from the base protocol and protocol extensions made features retained from the base protocol and protocol extensions made
subsequently. The current draft includes description of the major subsequently. The current draft includes description of the major
skipping to change at page 2, line 19 skipping to change at page 2, line 19
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 9 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 9 1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 9
1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 9 1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 9
1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 10 1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 10
1.4. Inconsistencies of this Document with Section XX . . . . 10 1.4. Overview of NFS version 4.1 Features . . . . . . . . . . 10
1.5. Overview of NFS version 4.1 Features . . . . . . . . . . 10 1.4.1. RPC and Security . . . . . . . . . . . . . . . . . . 11
1.5.1. RPC and Security . . . . . . . . . . . . . . . . . . 11 1.4.2. Protocol Structure . . . . . . . . . . . . . . . . . 11
1.5.2. Protocol Structure . . . . . . . . . . . . . . . . . 11 1.4.3. File System Model . . . . . . . . . . . . . . . . . 12
1.5.3. File System Model . . . . . . . . . . . . . . . . . . 12 1.4.4. Locking Facilities . . . . . . . . . . . . . . . . . 13
1.5.4. Locking Facilities . . . . . . . . . . . . . . . . . 13 1.5. General Definitions . . . . . . . . . . . . . . . . . . 14
1.6. General Definitions . . . . . . . . . . . . . . . . . . 14 1.6. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 16
1.7. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 16
2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 16 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 16
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 16 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 16
2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 16 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 16 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 16
2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 20 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 20
2.4. Client Identifiers . . . . . . . . . . . . . . . . . . . 20 2.4. Client Identifiers . . . . . . . . . . . . . . . . . . . 20
2.4.1. Server Release of Clientid . . . . . . . . . . . . . 24 2.4.1. Server Release of Clientid . . . . . . . . . . . . . 24
2.5. Security Service Negotiation . . . . . . . . . . . . . . 25 2.5. Security Service Negotiation . . . . . . . . . . . . . . 25
2.5.1. NFSv4 Security Tuples . . . . . . . . . . . . . . . . 25 2.5.1. NFSv4 Security Tuples . . . . . . . . . . . . . . . 25
2.5.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . . 25 2.5.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 25
2.5.3. Security Error . . . . . . . . . . . . . . . . . . . 26 2.5.3. Security Error . . . . . . . . . . . . . . . . . . . 26
2.6. Minor Versioning . . . . . . . . . . . . . . . . . . . . 28 2.6. Minor Versioning . . . . . . . . . . . . . . . . . . . . 29
2.7. Non-RPC-based Security Services . . . . . . . . . . . . 31 2.7. Non-RPC-based Security Services . . . . . . . . . . . . 31
2.7.1. Authorization . . . . . . . . . . . . . . . . . . . . 31 2.7.1. Authorization . . . . . . . . . . . . . . . . . . . 31
2.7.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 31 2.7.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 32
2.7.3. Intrusion Detection . . . . . . . . . . . . . . . . . 31 2.7.3. Intrusion Detection . . . . . . . . . . . . . . . . 32
2.8. Transport Layers . . . . . . . . . . . . . . . . . . . . 32 2.8. Transport Layers . . . . . . . . . . . . . . . . . . . . 32
2.8.1. Required and Recommended Properties of Transports . . 32 2.8.1. Required and Recommended Properties of Transports . 32
2.8.2. Client and Server Transport Behavior . . . . . . . . 32 2.8.2. Client and Server Transport Behavior . . . . . . . . 33
2.8.3. Ports . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 34
2.9. Session . . . . . . . . . . . . . . . . . . . . . . . . 34 2.9. Session . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9.1. Motivation and Overview . . . . . . . . . . . . . . . 34 2.9.1. Motivation and Overview . . . . . . . . . . . . . . 34
2.9.2. NFSv4 Integration . . . . . . . . . . . . . . . . . . 35 2.9.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 35
2.9.3. Channels . . . . . . . . . . . . . . . . . . . . . . 36 2.9.3. Channels . . . . . . . . . . . . . . . . . . . . . . 36
2.9.4. Exactly Once Semantics . . . . . . . . . . . . . . . 38 2.9.4. Exactly Once Semantics . . . . . . . . . . . . . . . 39
2.9.5. RDMA Considerations . . . . . . . . . . . . . . . . . 46 2.9.5. RDMA Considerations . . . . . . . . . . . . . . . . 47
2.9.6. Sessions Security . . . . . . . . . . . . . . . . . . 48 2.9.6. Sessions Security . . . . . . . . . . . . . . . . . 50
2.9.7. Session Mechanics - Steady State . . . . . . . . . . 53 2.9.7. Session Mechanics - Steady State . . . . . . . . . . 54
2.9.8. Session Mechanics - Recovery . . . . . . . . . . . . 54 2.9.8. Session Mechanics - Recovery . . . . . . . . . . . . 55
3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 57 3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 58
3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 57 3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 59
3.2. Structured Data Types . . . . . . . . . . . . . . . . . 59 3.2. Structured Data Types . . . . . . . . . . . . . . . . . 60
4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 68 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 69 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 70
4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . . 69 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 70
4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . . 69 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 70
4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 70 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 71
4.2.1. General Properties of a Filehandle . . . . . . . . . 70 4.2.1. General Properties of a Filehandle . . . . . . . . . 71
4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . . 71 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 72
4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . . 71 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 72
4.3. One Method of Constructing a Volatile Filehandle . . . . 72 4.3. One Method of Constructing a Volatile Filehandle . . . . 73
4.4. Client Recovery from Filehandle Expiration . . . . . . . 73 4.4. Client Recovery from Filehandle Expiration . . . . . . . 74
5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 74 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 75
5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 75 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 76
5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 75 5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 76
5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 76 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 77
5.4. Classification of Attributes . . . . . . . . . . . . . . 76 5.4. Classification of Attributes . . . . . . . . . . . . . . 77
5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 77 5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 78
5.6. Recommended Attributes - Definitions . . . . . . . . . . 79 5.6. Recommended Attributes - Definitions . . . . . . . . . . 80
5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 87 5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 89
5.8. Interpreting owner and owner_group . . . . . . . . . . . 87 5.8. Interpreting owner and owner_group . . . . . . . . . . . 90
5.9. Character Case Attributes . . . . . . . . . . . . . . . 89 5.9. Character Case Attributes . . . . . . . . . . . . . . . 92
5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 89 5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 92
5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 90 5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 93
5.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 91 5.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 94
5.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 92 5.13. fs_layout_type . . . . . . . . . . . . . . . . . . . . . 94
5.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 92 5.14. layout_type . . . . . . . . . . . . . . . . . . . . . . 94
5.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 92 5.15. layout_hint . . . . . . . . . . . . . . . . . . . . . . 95
5.16. mdsthreshold . . . . . . . . . . . . . . . . . . . . . . 92 5.16. mdsthreshold . . . . . . . . . . . . . . . . . . . . . . 95
6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 93 5.17. Retention Attributes . . . . . . . . . . . . . . . . . . 95
6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 93 6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 97
6.2. File Attributes Discussion . . . . . . . . . . . . . . . 94 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . . 94 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 99
6.2.2. mode Attribute . . . . . . . . . . . . . . . . . . . 105 6.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . 99
6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 106 6.2.2. mode Attribute . . . . . . . . . . . . . . . . . . . 110
6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . . 106 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 111
6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 107 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 111
6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 109 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 112
6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 109 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 113
6.4.2. Retrieving the mode and/or ACL Attributes . . . . . . 110 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 114
6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 111 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 115
7. Single-server Name Space . . . . . . . . . . . . . . . . . . 112 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 115
7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 112 7. Single-server Name Space . . . . . . . . . . . . . . . . . . 117
7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 113 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 117
7.3. Server Pseudo File System . . . . . . . . . . . . . . . 113 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 118
7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 114 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 118
7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 114 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 119
7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 114 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 119
7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 115 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 119
7.8. Security Policy and Name Space Presentation . . . . . . 115 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 119
8. File Locking and Share Reservations . . . . . . . . . . . . . 116 7.8. Security Policy and Name Space Presentation . . . . . . 120
8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 116 8. File Locking and Share Reservations . . . . . . . . . . . . . 121
8.1.1. Client and Session ID . . . . . . . . . . . . . . . . 117 8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 121
8.1.2. State-owner and Stateid Definition . . . . . . . . . 117 8.1.1. Client and Session ID . . . . . . . . . . . . . . . 122
8.1.3. Use of the Stateid and Locking . . . . . . . . . . . 120 8.1.2. State-owner and Stateid Definition . . . . . . . . . 122
8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 122 8.1.3. Use of the Stateid and Locking . . . . . . . . . . . 124
8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 122 8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 127
8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 123 8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 127
8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 124 8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 128
8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 124 8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 128
8.6.1. Client Failure and Recovery . . . . . . . . . . . . . 124 8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 129
8.6.2. Server Failure and Recovery . . . . . . . . . . . . . 125 8.6.1. Client Failure and Recovery . . . . . . . . . . . . 129
8.6.3. Network Partitions and Recovery . . . . . . . . . . . 127 8.6.2. Server Failure and Recovery . . . . . . . . . . . . 130
8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 131 8.6.3. Network Partitions and Recovery . . . . . . . . . . 132
8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 132 8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 136
8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 133 8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 137
8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 134 8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 138
8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 134 8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 139
8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 139
8.12. Clocks, Propagation Delay, and Calculating Lease 8.12. Clocks, Propagation Delay, and Calculating Lease
Expiration . . . . . . . . . . . . . . . . . . . . . . . 135 Expiration . . . . . . . . . . . . . . . . . . . . . . . 140
8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 135 8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 140
9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 136 9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 141
9.1. Performance Challenges for Client-Side Caching . . . . . 137 9.1. Performance Challenges for Client-Side Caching . . . . . 142
9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 138 9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 143
9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . . 139 9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 144
9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 141 9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 146
9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 141 9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 146
9.3.2. Data Caching and File Locking . . . . . . . . . . . . 142 9.3.2. Data Caching and File Locking . . . . . . . . . . . 147
9.3.3. Data Caching and Mandatory File Locking . . . . . . . 144 9.3.3. Data Caching and Mandatory File Locking . . . . . . 149
9.3.4. Data Caching and File Identity . . . . . . . . . . . 144 9.3.4. Data Caching and File Identity . . . . . . . . . . . 149
9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 145 9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 150
9.4.1. Open Delegation and Data Caching . . . . . . . . . . 148 9.4.1. Open Delegation and Data Caching . . . . . . . . . . 153
9.4.2. Open Delegation and File Locks . . . . . . . . . . . 149 9.4.2. Open Delegation and File Locks . . . . . . . . . . . 154
9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 149 9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 154
9.4.4. Recall of Open Delegation . . . . . . . . . . . . . . 152 9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 157
9.4.5. Clients that Fail to Honor Delegation Recalls . . . . 154 9.4.5. Clients that Fail to Honor Delegation Recalls . . . 159
9.4.6. Delegation Revocation . . . . . . . . . . . . . . . . 155 9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 160
9.5. Data Caching and Revocation . . . . . . . . . . . . . . 155 9.5. Data Caching and Revocation . . . . . . . . . . . . . . 160
9.5.1. Revocation Recovery for Write Open Delegation . . . . 156 9.5.1. Revocation Recovery for Write Open Delegation . . . 161
9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 157 9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 162
9.7. Data and Metadata Caching and Memory Mapped Files . . . 159 9.7. Data and Metadata Caching and Memory Mapped Files . . . 164
9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 161 9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 166
9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 162 9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 167
10. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 163 10. Multi-server Name Space . . . . . . . . . . . . . . . . . . . 168
10.1. Location attributes . . . . . . . . . . . . . . . . . . 163 10.1. Location attributes . . . . . . . . . . . . . . . . . . 168
10.2. File System Presence or Absence . . . . . . . . . . . . 163 10.2. File System Presence or Absence . . . . . . . . . . . . 168
10.3. Getting Attributes for an Absent File System . . . . . . 165 10.3. Getting Attributes for an Absent File System . . . . . . 170
10.3.1. GETATTR Within an Absent File System . . . . . . . . 165 10.3.1. GETATTR Within an Absent File System . . . . . . . . 170
10.3.2. READDIR and Absent File Systems . . . . . . . . . . . 166 10.3.2. READDIR and Absent File Systems . . . . . . . . . . 171
10.4. Uses of Location Information . . . . . . . . . . . . . . 167 10.4. Uses of Location Information . . . . . . . . . . . . . . 172
10.4.1. File System Replication . . . . . . . . . . . . . . . 167 10.4.1. File System Replication . . . . . . . . . . . . . . 172
10.4.2. File System Migration . . . . . . . . . . . . . . . . 168 10.4.2. File System Migration . . . . . . . . . . . . . . . 174
10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . . 169 10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 175
10.5. Additional Client-side Considerations . . . . . . . . . 169 10.5. Additional Client-side Considerations . . . . . . . . . 176
10.6. Effecting File System Transitions . . . . . . . . . . . 170 10.6. Effecting File System Transitions . . . . . . . . . . . 177
10.6.1. Transparent File System Transitions . . . . . . . . . 171 10.6.1. File System Transitions and Simultaneous Access . . 178
10.6.2. Filehandles and File System Transitions . . . . . . . 173 10.6.2. Simultaneous Use and Transparent Transitions . . . . 179
10.6.3. Fileid's and File System Transitions . . . . . . . . 173 10.6.3. Filehandles and File System Transitions . . . . . . 181
10.6.4. Fsid's and File System Transitions . . . . . . . . . 174 10.6.4. Fileid's and File System Transitions . . . . . . . . 181
10.6.5. The Change Attribute and File System Transitions . . 174 10.6.5. Fsid's and File System Transitions . . . . . . . . . 182
10.6.6. Lock State and File System Transitions . . . . . . . 175 10.6.6. The Change Attribute and File System Transitions . . 182
10.6.7. Write Verifiers and File System Transitions . . . . . 178 10.6.7. Lock State and File System Transitions . . . . . . . 183
10.7. Effecting File System Referrals . . . . . . . . . . . . 178 10.6.8. Write Verifiers and File System Transitions . . . . 186
10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . . 179 10.7. Effecting File System Referrals . . . . . . . . . . . . 186
10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 183 10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 187
10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 185 10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 191
10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 185 10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 193
10.10. The Attribute fs_locations_info . . . . . . . . . . . . 187 10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 193
10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 196 10.10. The Attribute fs_locations_info . . . . . . . . . . . . 195
11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 199 10.10.1. The location4_server Structure . . . . . . . . . . . 198
11.1. Introduction to Directory Delegations . . . . . . . . . 200 10.10.2. The location4_info Structure . . . . . . . . . . . . 203
11.2. Directory Delegation Design (in brief) . . . . . . . . . 201 10.10.3. The location4_item Structure . . . . . . . . . . . . 204
10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 205
11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 209
11.1. Introduction to Directory Delegations . . . . . . . . . 209
11.2. Directory Delegation Design (in brief) . . . . . . . . . 210
11.3. Recommended Attributes in support of Directory 11.3. Recommended Attributes in support of Directory
Delegations . . . . . . . . . . . . . . . . . . . . . . 202 Delegations . . . . . . . . . . . . . . . . . . . . . . 211
11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 203 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 212
11.5. Directory Delegation Recovery . . . . . . . . . . . . . 203 11.5. Directory Delegation Recovery . . . . . . . . . . . . . 212
12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 203 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 212
12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 203 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 212
12.2. General Definitions . . . . . . . . . . . . . . . . . . 206 12.2. General Definitions . . . . . . . . . . . . . . . . . . 215
12.2.1. Metadata Server . . . . . . . . . . . . . . . . . . . 206 12.2.1. Metadata Server . . . . . . . . . . . . . . . . . . 215
12.2.2. Client . . . . . . . . . . . . . . . . . . . . . . . 206 12.2.2. Client . . . . . . . . . . . . . . . . . . . . . . . 215
12.2.3. Storage Device . . . . . . . . . . . . . . . . . . . 206 12.2.3. Storage Device . . . . . . . . . . . . . . . . . . . 215
12.2.4. Storage Protocol . . . . . . . . . . . . . . . . . . 206 12.2.4. Storage Protocol . . . . . . . . . . . . . . . . . . 215
12.2.5. Control Protocol . . . . . . . . . . . . . . . . . . 207 12.2.5. Control Protocol . . . . . . . . . . . . . . . . . . 216
12.2.6. Metadata . . . . . . . . . . . . . . . . . . . . . . 207 12.2.6. Metadata . . . . . . . . . . . . . . . . . . . . . . 216
12.2.7. Layout . . . . . . . . . . . . . . . . . . . . . . . 207 12.2.7. Layout . . . . . . . . . . . . . . . . . . . . . . . 216
12.3. pNFS protocol semantics . . . . . . . . . . . . . . . . 208 12.3. pNFS protocol semantics . . . . . . . . . . . . . . . . 217
12.3.1. Definitions . . . . . . . . . . . . . . . . . . . . . 208 12.3.1. Definitions . . . . . . . . . . . . . . . . . . . . 217
12.3.2. Guarantees Provided by Layouts . . . . . . . . . . . 211 12.3.2. Guarantees Provided by Layouts . . . . . . . . . . . 220
12.3.3. Getting a Layout . . . . . . . . . . . . . . . . . . 212 12.3.3. Getting a Layout . . . . . . . . . . . . . . . . . . 221
12.3.4. Committing a Layout . . . . . . . . . . . . . . . . . 213 12.3.4. Committing a Layout . . . . . . . . . . . . . . . . 222
12.3.5. Recalling a Layout . . . . . . . . . . . . . . . . . 215 12.3.5. Recalling a Layout . . . . . . . . . . . . . . . . . 224
12.3.6. Metadata Server Write Propagation . . . . . . . . . . 221 12.3.6. Metadata Server Write Propagation . . . . . . . . . 230
12.3.7. Crash Recovery . . . . . . . . . . . . . . . . . . . 221 12.3.7. Crash Recovery . . . . . . . . . . . . . . . . . . . 230
12.3.8. Security Considerations . . . . . . . . . . . . . . . 227 12.3.8. Security Considerations . . . . . . . . . . . . . . 236
12.4. The NFSv4 File Layout Type . . . . . . . . . . . . . . . 228 12.4. The NFSv4.1 File Layout Type . . . . . . . . . . . . . . 237
12.4.1. File Striping and Data Access . . . . . . . . . . . . 228 12.4.1. Session Considerations . . . . . . . . . . . . . . . 237
12.4.2. Global Stateid Requirements . . . . . . . . . . . . . 236 12.4.2. File Striping and Data Access . . . . . . . . . . . 237
12.4.3. The Layout Iomode . . . . . . . . . . . . . . . . . . 236 12.4.3. Global Stateid Requirements . . . . . . . . . . . . 246
12.4.4. Storage Device State Propagation . . . . . . . . . . 237 12.4.4. The Layout Iomode . . . . . . . . . . . . . . . . . 246
12.4.5. Storage Device Component File Size . . . . . . . . . 239 12.4.5. Storage Device State Propagation . . . . . . . . . . 246
12.4.6. Crash Recovery Considerations . . . . . . . . . . . . 240 12.4.6. Storage Device Component File Size . . . . . . . . . 249
12.4.7. Security Considerations for the File Layout Type . . 240 12.4.7. Crash Recovery Considerations . . . . . . . . . . . 249
12.4.8. Alternate Approaches . . . . . . . . . . . . . . . . 241 12.4.8. Security Considerations for the File Layout Type . . 250
13. Internationalization . . . . . . . . . . . . . . . . . . . . 242 12.4.9. Alternate Approaches . . . . . . . . . . . . . . . . 250
13.1. Stringprep profile for the utf8str_cs type . . . . . . . 243 13. Internationalization . . . . . . . . . . . . . . . . . . . . 251
13.2. Stringprep profile for the utf8str_cis type . . . . . . 245 13.1. Stringprep profile for the utf8str_cs type . . . . . . . 253
13.3. Stringprep profile for the utf8str_mixed type . . . . . 246 13.2. Stringprep profile for the utf8str_cis type . . . . . . 254
13.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 247 13.3. Stringprep profile for the utf8str_mixed type . . . . . 256
14. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 248 13.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 257
14.1. Error Definitions . . . . . . . . . . . . . . . . . . . 248 14. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 257
14.2. Operations and their valid errors . . . . . . . . . . . 262 14.1. Error Definitions . . . . . . . . . . . . . . . . . . . 258
14.3. Callback operations and their valid errors . . . . . . . 275 14.2. Operations and their valid errors . . . . . . . . . . . 271
14.4. Errors and the operations that use them . . . . . . . . 276 14.3. Callback operations and their valid errors . . . . . . . 284
15. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 283 14.4. Errors and the operations that use them . . . . . . . . 285
15.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 283 15. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 292
15.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 284 15.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 292
16. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 289 15.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 293
16.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 289 16. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 298
16.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 291 16.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 298
16.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 293 16.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 300
16.4. Operation 6: CREATE - Create a Non-Regular File Object . 295 16.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 302
16.4. Operation 6: CREATE - Create a Non-Regular File Object . 304
16.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 16.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 298 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 307
16.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 299 16.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 308
16.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 299 16.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 308
16.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 301 16.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 310
16.9. Operation 11: LINK - Create Link to a File . . . . . . . 302 16.9. Operation 11: LINK - Create Link to a File . . . . . . . 311
16.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 303 16.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 312
16.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 307 16.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 316
16.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 308 16.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 317
16.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 309 16.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 318
16.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 311 16.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 320
16.15. Operation 17: NVERIFY - Verify Difference in 16.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 312 Attributes . . . . . . . . . . . . . . . . . . . . . . . 321
16.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 314 16.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 323
16.17. Operation 19: OPENATTR - Open Named Attribute 16.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 328 Directory . . . . . . . . . . . . . . . . . . . . . . . 337
16.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 329 16.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 338
16.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 330 16.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 339
16.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 331 16.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 340
16.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 332 16.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 342
16.22. Operation 25: READ - Read from File . . . . . . . . . . 333 16.22. Operation 25: READ - Read from File . . . . . . . . . . 343
16.23. Operation 26: READDIR - Read Directory . . . . . . . . . 335 16.23. Operation 26: READDIR - Read Directory . . . . . . . . . 345
16.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 339 16.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 349
16.25. Operation 28: REMOVE - Remove File System Object . . . . 340 16.25. Operation 28: REMOVE - Remove File System Object . . . . 350
16.26. Operation 29: RENAME - Rename Directory Entry . . . . . 342 16.26. Operation 29: RENAME - Rename Directory Entry . . . . . 352
16.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 344 16.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 354
16.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 345 16.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 355
16.29. Operation 33: SECINFO - Obtain Available Security . . . 346 16.29. Operation 33: SECINFO - Obtain Available Security . . . 355
16.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 349 16.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 359
16.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 352 16.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 361
16.32. Operation 38: WRITE - Write to File . . . . . . . . . . 353 16.32. Operation 38: WRITE - Write to File . . . . . . . . . . 362
16.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 357 16.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 367
16.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 359 16.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 369
16.35. Operation 42: CREATE_CLIENTID - Instantiate Clientid . . 363 16.35. Operation 42: EXCHANGE_ID - Instantiate Clientid . . . . 373
16.36. Operation 43: CREATE_SESSION - Create New Session and 16.36. Operation 43: CREATE_SESSION - Create New Session and
Confirm Clientid . . . . . . . . . . . . . . . . . . . . 369 Confirm Clientid . . . . . . . . . . . . . . . . . . . . 379
16.37. Operation 44: DESTROY_SESSION - Destroy existing 16.37. Operation 44: DESTROY_SESSION - Destroy existing
session . . . . . . . . . . . . . . . . . . . . . . . . 379 session . . . . . . . . . . . . . . . . . . . . . . . . 389
16.38. Operation 45: FREE_STATEID - Free stateid with no 16.38. Operation 45: FREE_STATEID - Free stateid with no
locks . . . . . . . . . . . . . . . . . . . . . . . . . 380 locks . . . . . . . . . . . . . . . . . . . . . . . . . 390
16.39. Operation 46: GET_DIR_DELEGATION - Get a directory 16.39. Operation 46: GET_DIR_DELEGATION - Get a directory
delegation . . . . . . . . . . . . . . . . . . . . . . . 381 delegation . . . . . . . . . . . . . . . . . . . . . . . 391
16.40. Operation 47: GETDEVICEINFO - Get Device Information . . 385 16.40. Operation 47: GETDEVICEINFO - Get Device Information . . 395
16.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 386 16.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 396
16.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 16.42. Operation 49: LAYOUTCOMMIT - Commit writes made using
a layout . . . . . . . . . . . . . . . . . . . . . . . . 387 a layout . . . . . . . . . . . . . . . . . . . . . . . . 397
16.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 391 16.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 401
16.44. Operation 51: LAYOUTRETURN - Release Layout 16.44. Operation 51: LAYOUTRETURN - Release Layout
Information . . . . . . . . . . . . . . . . . . . . . . 394 Information . . . . . . . . . . . . . . . . . . . . . . 404
16.45. Operation 52: SECINFO_NO_NAME - Get Security on 16.45. Operation 52: SECINFO_NO_NAME - Get Security on
Unnamed Object . . . . . . . . . . . . . . . . . . . . . 396 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 406
16.46. Operation 53: SEQUENCE - Supply per-procedure 16.46. Operation 53: SEQUENCE - Supply per-procedure
sequencing and control . . . . . . . . . . . . . . . . . 397 sequencing and control . . . . . . . . . . . . . . . . . 408
16.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 401 16.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 411
16.48. Operation 55: TEST_STATEID - Test stateids for 16.48. Operation 55: TEST_STATEID - Test stateids for
validity . . . . . . . . . . . . . . . . . . . . . . . . 402 validity . . . . . . . . . . . . . . . . . . . . . . . . 413
16.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 403 16.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 414
16.50. Operation 10044: ILLEGAL - Illegal operation . . . . . . 406 16.50. Operation 10044: ILLEGAL - Illegal operation . . . . . . 417
17. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 407 17. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 418
17.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 407 17.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 418
17.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 407 17.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 418
18. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 409 18. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 420
18.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 409 18.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 420
18.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 411 18.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 422
18.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 412 18.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 423
18.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 414 18.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 425
18.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 417 18.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 428
18.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 418 18.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 429
18.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 421 18.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 432
18.8. Operation 10: CB_RECALL_SLOT - change flow control 18.8. Operation 10: CB_RECALL_SLOT - change flow control
limits . . . . . . . . . . . . . . . . . . . . . . . . . 422 limits . . . . . . . . . . . . . . . . . . . . . . . . . 433
18.9. Operation 11: CB_SEQUENCE - Supply callback channel 18.9. Operation 11: CB_SEQUENCE - Supply callback channel
sequencing and control . . . . . . . . . . . . . . . . . 423 sequencing and control . . . . . . . . . . . . . . . . . 434
18.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 425 18.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 436
18.11. Operation 10044: CB_ILLEGAL - Illegal Callback 18.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible
Operation . . . . . . . . . . . . . . . . . . . . . . . 426 lock availability . . . . . . . . . . . . . . . . . . . 437
19. Security Considerations . . . . . . . . . . . . . . . . . . . 427 18.12. Operation 10044: CB_ILLEGAL - Illegal Callback
20. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 427 Operation . . . . . . . . . . . . . . . . . . . . . . . 438
20.1. Defining new layout types . . . . . . . . . . . . . . . 427 19. Security Considerations . . . . . . . . . . . . . . . . . . . 439
21. References . . . . . . . . . . . . . . . . . . . . . . . . . 428 20. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 439
21.1. Normative References . . . . . . . . . . . . . . . . . . 428 20.1. Defining new layout types . . . . . . . . . . . . . . . 439
21.2. Informative References . . . . . . . . . . . . . . . . . 429 21. References . . . . . . . . . . . . . . . . . . . . . . . . . 440
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 430 21.1. Normative References . . . . . . . . . . . . . . . . . . 440
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 431 21.2. Informative References . . . . . . . . . . . . . . . . . 441
Intellectual Property and Copyright Statements . . . . . . . . . 432 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 442
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 443
Intellectual Property and Copyright Statements . . . . . . . . . 444
1. Introduction 1. Introduction
1.1. The NFSv4.1 Protocol 1.1. The NFSv4.1 Protocol
The NFSv4.1 protocol is a minor version of the NFSv4 protocol The NFSv4.1 protocol is a minor version of the NFSv4 protocol
described in [2]. It generally follows the guidelines for minor described in [2]. It generally follows the guidelines for minor
versioning model laid in Section 10 of RFC 3530. However, it versioning model laid in Section 10 of RFC 3530. However, it
diverges from guidelines 11 ("a client and server that supports minor diverges from guidelines 11 ("a client and server that supports minor
version X must support minor versions 0 through X-1"), and 12 ("no version X must support minor versions 0 through X-1"), and 12 ("no
skipping to change at page 10, line 34 skipping to change at page 10, line 34
o To add clarity and specificity to areas left unaddressed or not o To add clarity and specificity to areas left unaddressed or not
addressed in sufficient detail in the base protocol. addressed in sufficient detail in the base protocol.
o To add specific features based on experience with the existing o To add specific features based on experience with the existing
protocol and recent industry developments. protocol and recent industry developments.
o To provide protocol support to take advantage of clustered server o To provide protocol support to take advantage of clustered server
deployments including the ability to provide scalable parallel deployments including the ability to provide scalable parallel
access to files distributed among multiple servers. access to files distributed among multiple servers.
1.4. Inconsistencies of this Document with Section XX 1.4. Overview of NFS version 4.1 Features
Section XX, RPC Definition File, contains the definitions in XDR
description language of the constructs used by the protocol. Prior
to this section, several of the constructs are reproduced for
purposes of explanation. Although every effort has been made to
assure a correct and consistent description, the possibility of
inconsistencies exists. For any part of the document that is
inconsistent with Section XX, Section XX is to be considered
authoritative.
1.5. Overview of NFS version 4.1 Features
To provide a reasonable context for the reader, the major features of To provide a reasonable context for the reader, the major features of
NFS version 4.1 protocol will be reviewed in brief. This will be NFS version 4.1 protocol will be reviewed in brief. This will be
done to provide an appropriate context for both the reader who is done to provide an appropriate context for both the reader who is
familiar with the previous versions of the NFS protocol and the familiar with the previous versions of the NFS protocol and the
reader that is new to the NFS protocols. For the reader new to the reader that is new to the NFS protocols. For the reader new to the
NFS protocols, there is still a set of fundamental knowledge that is NFS protocols, there is still a set of fundamental knowledge that is
expected. The reader should be familiar with the XDR and RPC expected. The reader should be familiar with the XDR and RPC
protocols as described in [3] and [4]. A basic knowledge of file protocols as described in [3] and [4]. A basic knowledge of file
systems and distributed file systems is expected as well. systems and distributed file systems is expected as well.
This description of version 4.1 features will not distinguish those This description of version 4.1 features will not distinguish those
added in minor version one from those present in the base protocol added in minor version one from those present in the base protocol
but will treat minor version 1 as a unified whole. See Section 1.7 but will treat minor version 1 as a unified whole. See Section 1.6
for a description of the differences between the two minor versions. for a description of the differences between the two minor versions.
1.5.1. RPC and Security 1.4.1. RPC and Security
As with previous versions of NFS, the External Data Representation As with previous versions of NFS, the External Data Representation
(XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS
version 4.1 protocol are those defined in [3] and [4]. To meet end- version 4.1 protocol are those defined in [3] and [4]. To meet end-
to-end security requirements, the RPCSEC_GSS framework [5] will be to-end security requirements, the RPCSEC_GSS framework [5] will be
used to extend the basic RPC security. With the use of RPCSEC_GSS, used to extend the basic RPC security. With the use of RPCSEC_GSS,
various mechanisms can be provided to offer authentication, various mechanisms can be provided to offer authentication,
integrity, and privacy to the NFS version 4 protocol. Kerberos V5 integrity, and privacy to the NFS version 4 protocol. Kerberos V5
will be used as described in [6] to provide one security framework. will be used as described in [6] to provide one security framework.
The LIPKEY and SPKM-3 GSS-API mechanisms described in [7] will be The LIPKEY and SPKM-3 GSS-API mechanisms described in [7] will be
skipping to change at page 11, line 36 skipping to change at page 11, line 28
RPCSEC_GSS, other mechanisms may also be specified and used for NFS RPCSEC_GSS, other mechanisms may also be specified and used for NFS
version 4.1 security. version 4.1 security.
To enable in-band security negotiation, the NFS version 4.1 protocol To enable in-band security negotiation, the NFS version 4.1 protocol
has operations which provide the client a method of querying the has operations which provide the client a method of querying the
server about its policies regarding which security mechanisms must be server about its policies regarding which security mechanisms must be
used for access to the server's file system resources. With this, used for access to the server's file system resources. With this,
the client can securely match the security mechanism that meets the the client can securely match the security mechanism that meets the
policies specified at both the client and server. policies specified at both the client and server.
1.5.2. Protocol Structure 1.4.2. Protocol Structure
1.5.2.1. Core Protocol 1.4.2.1. Core Protocol
Unlike NFS Versions 2 and 3, which used a series of ancillary Unlike NFS Versions 2 and 3, which used a series of ancillary
protocols (e.g. NLM, NSM, MOUNT), within all minor versions of NFS protocols (e.g. NLM, NSM, MOUNT), within all minor versions of NFS
version 4 only a single RPC protocol is used to make requests of the version 4 only a single RPC protocol is used to make requests of the
server. Facilities that had been separate protocols, such as server. Facilities that had been separate protocols, such as
locking, are now integrated within a single unified protocol. locking, are now integrated within a single unified protocol.
1.5.2.2. Parallel Access 1.4.2.2. Parallel Access
Minor version one supports high-performance data access to a Minor version one supports high-performance data access to a
clustered server implementation by enabling a separation of metadata clustered server implementation by enabling a separation of metadata
access and data access, with the latter done to multiple servers in access and data access, with the latter done to multiple servers in
parallel. parallel.
Such parallel data access is controlled by recallable objects known Such parallel data access is controlled by recallable objects known
as "layouts", which are integrated into the protocol locking model. as "layouts", which are integrated into the protocol locking model.
Clients direct requests for data access to a set of data servers Clients direct requests for data access to a set of data servers
specified by the layout via a data storage protocol which may be specified by the layout via a data storage protocol which may be
NFSv4.1 or may be another protocol. NFSv4.1 or may be another protocol.
1.5.3. File System Model 1.4.3. File System Model
The general file system model used for the NFS version 4.1 protocol The general file system model used for the NFS version 4.1 protocol
is the same as previous versions. The server file system is is the same as previous versions. The server file system is
hierarchical with the regular files contained within being treated as hierarchical with the regular files contained within being treated as
opaque byte streams. In a slight departure, file and directory names opaque byte streams. In a slight departure, file and directory names
are encoded with UTF-8 to deal with the basics of are encoded with UTF-8 to deal with the basics of
internationalization. internationalization.
The NFS version 4.1 protocol does not require a separate protocol to The NFS version 4.1 protocol does not require a separate protocol to
provide for the initial mapping between path name and filehandle. provide for the initial mapping between path name and filehandle.
All file systems exported by a server are presented as a tree so that All file systems exported by a server are presented as a tree so that
all file systems are reachable from a special per-server global root all file systems are reachable from a special per-server global root
filehandle. This allows LOOKUP operations to be used to perform filehandle. This allows LOOKUP operations to be used to perform
functions previously provided by the MOUNT protocol. The server functions previously provided by the MOUNT protocol. The server
provides any necessary pseudo filesystems to bridge any gaps that provides any necessary pseudo filesystems to bridge any gaps that
arise due unexported gaps between exported file systems. arise due unexported gaps between exported file systems.
1.5.3.1. Filehandles 1.4.3.1. Filehandles
As in previous versions of the NFS protocol, opaque filehandles are As in previous versions of the NFS protocol, opaque filehandles are
used to identify individual files and directories. Lookup-type and used to identify individual files and directories. Lookup-type and
create operations are used to go from file and directory names to the create operations are used to go from file and directory names to the
filehandle which is then used to identify the object to subsequent filehandle which is then used to identify the object to subsequent
operations. operations.
The NFS version 4.1 protocol provides support for both persistent The NFS version 4.1 protocol provides support for both persistent
filehandles, guaranteed to be valid for the lifetime of the file filehandles, guaranteed to be valid for the lifetime of the file
system object designated. In addition it provides support to servers system object designated. In addition it provides support to servers
to provide filehandles with more limited validity guarantees, called to provide filehandles with more limited validity guarantees, called
volatile filehandles. volatile filehandles.
1.5.3.2. File Attributes 1.4.3.2. File Attributes
The NFS version 4.1 protocol has a rich and extensible attribute The NFS version 4.1 protocol has a rich and extensible attribute
structure. Only a small set of the defined attributes are mandatory structure. Only a small set of the defined attributes are mandatory
and must be provided by all server implementations. The other and must be provided by all server implementations. The other
attributes are known as "recommended" attributes. attributes are known as "recommended" attributes.
One significant recommended file attribute is the Access Control List One significant recommended file attribute is the Access Control List
(ACL) attribute. This attribute provides for directory and file (ACL) attribute. This attribute provides for directory and file
access control beyond the model used in NFS Versions 2 and 3. The access control beyond the model used in NFS Versions 2 and 3. The
ACL definition allows for specification specific sets of permissions ACL definition allows for specification specific sets of permissions
for individual users and groups. In addition, ACL inheritance allows for individual users and groups. In addition, ACL inheritance allows
propagation of access permissions and restriction down a directory propagation of access permissions and restriction down a directory
tree as filesystem objects are created. tree as filesystem objects are created.
One other type of attribute is the named attribute. A named One other type of attribute is the named attribute. A named
attribute is an opaque byte stream that is associated with a attribute is an opaque byte stream that is associated with a
directory or file and referred to by a string name. Named attributes directory or file and referred to by a string name. Named attributes
are meant to be used by client applications as a method to associate are meant to be used by client applications as a method to associate
application specific data with a regular file or directory. application specific data with a regular file or directory.
1.5.3.3. Multi-server Namespace 1.4.3.3. Multi-server Namespace
NFS Version 4.1 contains a number of features to allow implementation NFS Version 4.1 contains a number of features to allow implementation
of namespaces that cross server boundaries and that allow to and of namespaces that cross server boundaries and that allow to and
facilitate a non-disruptive transfer of support for individual file facilitate a non-disruptive transfer of support for individual file
systems between servers. They are all based upon attributes that systems between servers. They are all based upon attributes that
allow one file system to specify alternate or new locations for that allow one file system to specify alternate or new locations for that
file system. file system.
These attributes may be used together with the concept of absent file These attributes may be used together with the concept of absent file
system which provide specifications for additional locations but no system which provide specifications for additional locations but no
actual file system content. This allows a number of important actual file system content. This allows a number of important
facilities: facilities:
o Location attributes may be used with absent file systems to o Location attributes may be used with absent file systems to
implement referrals whereby one server may direct the client to a implement referrals whereby one server may direct the client to a
file system provided by another server. This allows extensive file system provided by another server. This allows extensive
mult-server namespaces to be constructed. multi-server namespaces to be constructed.
o Location attributes may be provided for present file systems to o Location attributes may be provided for present file systems to
provide the locations alternate file system instances or replicas provide the locations alternate file system instances or replicas
to be used in the event that the current file system instance to be used in the event that the current file system instance
becomes unavailable. becomes unavailable.
o Location attributes may be provided when a previously present file o Location attributes may be provided when a previously present file
system becomes absent. This allows non-disruptive migration of system becomes absent. This allows non-disruptive migration of
file systems to alternate servers. file systems to alternate servers.
1.5.4. Locking Facilities 1.4.4. Locking Facilities
As mentioned previously, NFS v4.1, is a single protocol which As mentioned previously, NFS v4.1, is a single protocol which
includes locking facilities. These locking facilities include includes locking facilities. These locking facilities include
support for many types of locks including a number of sorts of support for many types of locks including a number of sorts of
recallable locks. Recallable locks such as delegations allow the recallable locks. Recallable locks such as delegations allow the
client to be assured that certain events will not occur so long as client to be assured that certain events will not occur so long as
that lock is held. When circumstances change, the lock is recalled that lock is held. When circumstances change, the lock is recalled
via a callback via a callback request. The assurances provided by via a callback via a callback request. The assurances provided by
delegations allow more extensive caching to be done safely when delegations allow more extensive caching to be done safely when
circumstances allow it. circumstances allow it.
skipping to change at page 14, line 31 skipping to change at page 14, line 27
client and that no change to the data's location inconsistent with client and that no change to the data's location inconsistent with
that access may be made so long as the layout is held. that access may be made so long as the layout is held.
All locks for a given client are tied together under a single client- All locks for a given client are tied together under a single client-
wide lease. All requests made on sessions associated with the client wide lease. All requests made on sessions associated with the client
renew that lease. When leases are not promptly renewed lock are renew that lease. When leases are not promptly renewed lock are
subject to revocation. In the event of server reinitialization, subject to revocation. In the event of server reinitialization,
clients have the opportunity to safely reclaim their locks within a clients have the opportunity to safely reclaim their locks within a
special grace period. special grace period.
1.6. General Definitions 1.5. General Definitions
The following definitions are provided for the purpose of providing The following definitions are provided for the purpose of providing
an appropriate context for the reader. an appropriate context for the reader.
Client The "client" is the entity that accesses the NFS server's Client The "client" is the entity that accesses the NFS server's
resources. The client may be an application which contains the resources. The client may be an application which contains the
logic to access the NFS server directly. The client may also be logic to access the NFS server directly. The client may also be
the traditional operating system client remote file system the traditional operating system client remote file system
services for a set of applications. services for a set of applications.
skipping to change at page 16, line 14 skipping to change at page 16, line 9
Stateid A 128-bit quantity returned by a server that uniquely Stateid A 128-bit quantity returned by a server that uniquely
defines the open and locking state provided by the server for a defines the open and locking state provided by the server for a
specific open or lock owner for a specific file. meaning and are specific open or lock owner for a specific file. meaning and are
reserved values. reserved values.
Verifier A 64-bit quantity generated by the client that the server Verifier A 64-bit quantity generated by the client that the server
can use to determine if the client has restarted and lost all can use to determine if the client has restarted and lost all
previous lock state. previous lock state.
1.7. Differences from NFSv4.0 1.6. Differences from NFSv4.0
The following summarizes the differences between minor version one The following summarizes the differences between minor version one
and the base protocol: and the base protocol:
o Implementation of the sessions model. o Implementation of the sessions model.
o Support for parallel access to data. o Support for parallel access to data.
o Addition of the RECLAIM_COMPLETE operation to better structure the o Addition of the RECLAIM_COMPLETE operation to better structure the
lock reclamation process. lock reclamation process.
skipping to change at page 16, line 51 skipping to change at page 16, line 46
2.2. RPC and XDR 2.2. RPC and XDR
The NFS version 4.1 (NFSv4.1) protocol is a Remote Procedure Call The NFS version 4.1 (NFSv4.1) protocol is a Remote Procedure Call
(RPC) application that uses RPC version 2 and the corresponding (RPC) application that uses RPC version 2 and the corresponding
eXternal Data Representation (XDR) as defined in RFC1831 [4] and eXternal Data Representation (XDR) as defined in RFC1831 [4] and
RFC4506 [3]. RFC4506 [3].
2.2.1. RPC-based Security 2.2.1. RPC-based Security
Previous NFS versions have been thought of as having a host-based Previous NFS versions have been thought of as having a host-based
authentication model, where the NFS server authenticates the the NFS authentication model, where the NFS server authenticates the NFS
client, and trust the client to authenticate all users. Actually, client, and trust the client to authenticate all users. Actually,
NFS has always depended on RPC for authentication. The first form of NFS has always depended on RPC for authentication. The first form of
RPC authentication which required a host-based authentication RPC authentication which required a host-based authentication
approach. NFSv4 also depends on RPC for basic security services, and approach. NFSv4 also depends on RPC for basic security services, and
mandates RPC support for a user-based authentication model. The mandates RPC support for a user-based authentication model. The
user-based authentication model has user principals authenticated by user-based authentication model has user principals authenticated by
a server, and in turn the server authenticated by user principals. a server, and in turn the server authenticated by user principals.
RPC provides some basic security services which are used by NFSv4. RPC provides some basic security services which are used by NFSv4.
2.2.1.1. RPC Security Flavors 2.2.1.1. RPC Security Flavors
skipping to change at page 21, line 15 skipping to change at page 21, line 15
Clientid's are used to support lock identification and crash Clientid's are used to support lock identification and crash
recovery. recovery.
In NFSv4.1, the clientid associated with each operation is derived In NFSv4.1, the clientid associated with each operation is derived
from the session (see Section 2.9) on which the operation is issued. from the session (see Section 2.9) on which the operation is issued.
Each session is associated with a specific clientid at session Each session is associated with a specific clientid at session
creation and that clientid then becomes the clientid associated with creation and that clientid then becomes the clientid associated with
all requests issued using it. Therefore, unlike NFSv4.0, no NFSv4.1 all requests issued using it. Therefore, unlike NFSv4.0, no NFSv4.1
operation is possible until a clientid is established. operation is possible until a clientid is established.
A sequence of a CREATE_CLIENTID operation followed by a A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION
CREATE_SESSION operation using that clientid is required to establish operation using that clientid is required to establish the
the identification on the server. Establishment of identification by identification on the server. Establishment of identification by a
a new incarnation of the client also has the effect of immediately new incarnation of the client also has the effect of immediately
releasing any locking state that a previous incarnation of that same releasing any locking state that a previous incarnation of that same
client might have had on the server. Such released state would client might have had on the server. Such released state would
include all lock, share reservation, and, where the server is not include all lock, share reservation, and, where the server is not
supporting the CLAIM_DELEGATE_PREV claim type, all delegation state supporting the CLAIM_DELEGATE_PREV claim type, all delegation state
associated with same client with the same identity. For discussion associated with same client with the same identity. For discussion
of delegation state recovery, see Section 9.2.1. of delegation state recovery, see Section 9.2.1.
Releasing such state requires that the server be able to determine Releasing such state requires that the server be able to determine
that one client instance is the successor of another. Where this that one client instance is the successor of another. Where this
cannot be done, for any of a number of reasons, the locking state cannot be done, for any of a number of reasons, the locking state
will remain for a time subject to lease expiration (see Section 8.5) will remain for a time subject to lease expiration (see Section 8.5)
and the new client will need to wait for such state to be removed, if and the new client will need to wait for such state to be removed, if
it makes conflicting lock requests. it makes conflicting lock requests.
Client identification is encapsulated in the following structure: Client identification is encapsulated in the following structure:
struct nfs_client_id4 { struct client_owner4 {
verifier4 verifier; verifier4 co_verifier;
opaque id<NFS4_OPAQUE_LIMIT>; opaque co_ownerid<NFS4_OPAQUE_LIMIT>;
}; };
The first field, verifier, is a client incarnation verifier that is The first field, co_verifier, is a client incarnation verifier that
used to detect client reboots. Only if the verifier is different is used to detect client reboots. Only if the co_verifier is
from that the server had previously recorded for the client (as different from that the server had previously recorded for the client
identified by the second field of the structure, id) does the server (as identified by the second field of the structure, co_ownerid) does
start the process of canceling the client's leased state. the server start the process of canceling the client's leased state.
The second field, id is a variable length string that uniquely The second field, co_ownerid is a variable length string that
defines the client so that subsequent instances of the same client uniquely defines the client so that subsequent instances of the same
bear the same id with a different verifier. client bear the same co_ownerid with a different verifier.
There are several considerations for how the client generates the id There are several considerations for how the client generates the
string: co_ownerid string:
o The string should be unique so that multiple clients do not o The string should be unique so that multiple clients do not
present the same string. The consequences of two clients present the same string. The consequences of two clients
presenting the same string range from one client getting an error presenting the same string range from one client getting an error
to one client having its leased state abruptly and unexpectedly to one client having its leased state abruptly and unexpectedly
canceled. canceled.
o The string should be selected so the subsequent incarnations (e.g. o The string should be selected so the subsequent incarnations (e.g.
reboots) of the same client cause the client to present the same reboots) of the same client cause the client to present the same
string. The implementor is cautioned from an approach that string. The implementor is cautioned from an approach that
requires the string to be recorded in a local file because this requires the string to be recorded in a local file because this
precludes the use of the implementation in an environment where precludes the use of the implementation in an environment where
there is no local disk and all file access is from an NFS version there is no local disk and all file access is from an NFS version
4 server. 4 server.
o The string should be different for each server network address o The string should be the same for each server network address that
that the client accesses, rather than common to all server network the client accesses, rather than common to all server network
addresses. The reason is that it may not be possible for the addresses (note: the precise opposite was advised in RFC3530).
client to tell if same server is listening on multiple network This way, if a server has multiple interfaces, the client can
addresses. If the client issues CREATE_CLIENTID with the same id trunk traffic over multiple network paths as described in
string to each network address of such a server, the server will Section 2.9.3.4.1.
think it is the same client, and each successive CREATE_CLIENTID
will cause the server remove the client's previous leased state.
Regardless, as described in Section 2.9.3.4.1, NFSv4.1 does allow
clients to trunk traffic for a single clientid to one or more of a
server's networking addresses.
o The algorithm for generating the string should not assume that the o The algorithm for generating the string should not assume that the
client's network address will not change. This includes changes client's network address will not change. This includes changes
between client incarnations and even changes while the client is between client incarnations and even changes while the client is
still running in its current incarnation. This means that if the still running in its current incarnation. This means that if the
client includes just the client's and server's network address in client includes just the client's and server's network address in
the id string, there is a real risk, after the client gives up the the co_ownerid string, there is a real risk, after the client
network address, that another client, using a similar algorithm gives up the network address, that another client, using a similar
for generating the id string, would generate a conflicting id algorithm for generating the co_ownerid string, would generate a
string. conflicting co_ownerid string.
Given the above considerations, an example of a well generated id
string is one that includes:
o The server's network address. Given the above considerations, an example of a well generated
co_ownerid string is one that includes:
o The client's network address. o The client's network address.
o For a user level NFS version 4 client, it should contain o For a user level NFS version 4 client, it should contain
additional information to distinguish the client from other user additional information to distinguish the client from other user
level clients running on the same host, such as a process id or level clients running on the same host, such as a process id or
other unique sequence. other unique sequence.
o Additional information that tends to be unique, such as one or o Additional information that tends to be unique, such as one or
more of: more of:
skipping to change at page 23, line 26 skipping to change at page 23, line 19
previously mentioned caution about using information that is previously mentioned caution about using information that is
stored in a file, because the file might only be accessible stored in a file, because the file might only be accessible
over NFS version 4). over NFS version 4).
* A true random number. However since this number ought to be * A true random number. However since this number ought to be
the same between client incarnations, this shares the same the same between client incarnations, this shares the same
problem as that of the using the timestamp of the software problem as that of the using the timestamp of the software
installation. installation.
As a security measure, the server MUST NOT cancel a client's leased As a security measure, the server MUST NOT cancel a client's leased
state if the principal established the state for a given id string is state if the principal established the state for a given co_ownerid
not the same as the principal issuing the CREATE_CLIENTID. string is not the same as the principal issuing the EXCHANGE_ID.
A server may compare an nfs_client_id4 in a CREATE_CLIENTID with an A server may compare an client_owner4 in a EXCHANGE_ID with an
nfs_client_id4 established using SETCLIENTID using NFSv4 minor nfs_client_id4 established using SETCLIENTID using NFSv4 minor
version 0, so that an NFSv4.1 client is not forced to delay until version 0, so that an NFSv4.1 client is not forced to delay until
lease expiration for locking state established by the earlier client lease expiration for locking state established by the earlier client
using minor version 0. using minor version 0. This requires the client_owner4 be
constructed the same way as the nfs_client_id4. If the latter's
contents included the server's network address, and the NFSv4.1
client does not wish to use a clientid that prevents trunking, it
should issue two EXCHANGE_ID operations. The first EXCHANGE_ID will
have a client_owner4 equal to the nfs_client_id4. This will clear
the state created by the NFSv4.0 client. The second EXCHANGE_ID will
not have the server's network address. The state created for the
second EXCHANGE_ID will not have to wait for lease expiration,
because there will be no state to expire.
Once a CREATE_CLIENTID has been done, and the resulting clientid Once a EXCHANGE_ID has been done, and the resulting clientid
established as associated with a session, all requests made on that established as associated with a session, all requests made on that
session implicitly identify that clientid, which in turn designates session implicitly identify that clientid, which in turn designates
the client specified using the long-form nfs_client_id4 structure. the client specified using the long-form client_owner4 structure.
The shorthand client identifier (a clientid) is assigned by the The shorthand client identifier (a clientid) is assigned by the
server and should be chosen so that it will not conflict with a server and should be chosen so that it will not conflict with a
clientid previously assigned by the server. This applies across clientid previously assigned by the server. This applies across
server restarts or reboots. server restarts or reboots.
In the event of a server restart, a client will find out that its In the event of a server restart, a client will find out that its
current clientid is no longer valid when receives a current clientid is no longer valid when receives a
NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of
the characteristics of the sessions involved, specifically whether the characteristics of the sessions involved, specifically whether
the session is persistent (see Section 2.9.4.5). the session is persistent (see Section 2.9.4.5).
When a session is not persistent, the client will need to create a When a session is not persistent, the client will need to create a
new session. When the existing clientid is presented to a server as new session. When the existing clientid is presented to a server as
part of creating a session and that clientid is not recognized, as part of creating a session and that clientid is not recognized, as
would happen after a server reboot, the server will reject the would happen after a server reboot, the server will reject the
request with the error NFS4ERR_STALE_CLIENTID. When this happens, request with the error NFS4ERR_STALE_CLIENTID. When this happens,
the client must obtain a new clientid by use of the CREATE_CLIENTID the client must obtain a new clientid by use of the EXCHANGE_ID
operation and then use that clientid as the basis of the basis of a operation and then use that clientid as the basis of the basis of a
new session and then proceed to any other necessary recovery for the new session and then proceed to any other necessary recovery for the
server reboot case (See Section 8.6.2). server reboot case (See Section 8.6.2).
In the case of the session being persistent, the client will re- In the case of the session being persistent, the client will re-
establish communication using the existing session after the reboot. establish communication using the existing session after the reboot.
This session will be associated with a stale clientid and the client This session will be associated with a stale clientid and the client
will receive an indication of that fact in the sr_status field will receive an indication of that fact in the sr_status field
returned by the SEQUENCE operation (see Section 2.9.2.1). The client returned by the SEQUENCE operation (see Section 2.9.2.1). The client
can then use the existing session to do whatever operations are can then use the existing session to do whatever operations are
necessary to determine the status of requests outstanding at the time necessary to determine the status of requests outstanding at the time
of reboot, while avoiding issuing new requests, particularly any of reboot, while avoiding issuing new requests, particularly any
involving locking on that session. Such requests would fail with involving locking on that session. Such requests would fail with
NFS4ERR_STALE_CLIENTID error or an NFS4ERR_STALE_STATEID error, if NFS4ERR_STALE_CLIENTID error or an NFS4ERR_STALE_STATEID error, if
attempted. In any case, the client would create a new clientid using attempted. In any case, the client would create a new clientid using
CREATE_CLIENTID, create a new session based on that clientid, and EXCHANGE_ID, create a new session based on that clientid, and proceed
proceed to other necessary recovery for the server reboot case. to other necessary recovery for the server reboot case.
See the detailed descriptions of CREATE_CLIENTID (Section 16.35 and See the detailed descriptions of EXCHANGE_ID (Section 16.35 and
CREATE_SESSION (Section 16.36) for a complete specification of these CREATE_SESSION (Section 16.36) for a complete specification of these
operations. operations.
2.4.1. Server Release of Clientid 2.4.1. Server Release of Clientid
If the server determines that the client holds no associated state If the server determines that the client holds no associated state
for its clientid, the server may choose to release the clientid. The for its clientid, the server may choose to release the clientid. The
server may make this choice for an inactive client so that resources server may make this choice for an inactive client so that resources
are not consumed by those intermittently active clients. If the are not consumed by those intermittently active clients. If the
client contacts the server after this release, the server must ensure client contacts the server after this release, the server must ensure
the client receives the appropriate error so that it will use the the client receives the appropriate error so that it will use the
CREATE_CLIENTID/CREATE_SESSION sequence to establish a new identity. EXCHANGE_ID/CREATE_SESSION sequence to establish a new identity. It
It should be clear that the server must be very hesitant to release a should be clear that the server must be very hesitant to release a
clientid since the resulting work on the client to recover from such clientid since the resulting work on the client to recover from such
an event will be the same burden as if the server had failed and an event will be the same burden as if the server had failed and
restarted. Typically a server would not release a clientid unless restarted. Typically a server would not release a clientid unless
there had been no activity from that client for many minutes. Note there had been no activity from that client for many minutes. Note
that "associated state" includes sessions. As long as there are that "associated state" includes sessions. As long as there are
sessions, the server MUST not release the clientid. See sessions, the server MUST not release the clientid. See
Section 2.9.8.1.4 for discussion on releasing inactive sessions. Section 2.9.8.1.4 for discussion on releasing inactive sessions.
Note that if the id string in a CREATE_CLIENTID request is properly Note that if the id string in a EXCHANGE_ID request is properly
constructed, and if the client takes care to use the same principal constructed, and if the client takes care to use the same principal
for each successive use of CREATE_CLIENTID, then, barring an active for each successive use of EXCHANGE_ID, then, barring an active
denial of service attack, NFS4ERR_CLID_INUSE should never be denial of service attack, NFS4ERR_CLID_INUSE should never be
returned. returned.
However, client bugs, server bugs, or perhaps a deliberate change of However, client bugs, server bugs, or perhaps a deliberate change of
the principal owner of the id string (such as the case of a client the principal owner of the id string (such as the case of a client
that changes security flavors, and under the new flavor, there is no that changes security flavors, and under the new flavor, there is no
mapping to the previous owner) will in rare cases result in mapping to the previous owner) will in rare cases result in
NFS4ERR_CLID_INUSE. NFS4ERR_CLID_INUSE.
In that event, when the server gets a CREATE_CLIENTID for a client id In that event, when the server gets a EXCHANGE_ID for a client id
that currently has no state, or it has state, but the lease has that currently has no state, or it has state, but the lease has
expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST
allow the CREATE_CLIENTID, and confirm the new clientid if followed allow the EXCHANGE_ID, and confirm the new clientid if followed by
by the appropriate CREATE_SESSION. the appropriate CREATE_SESSION.
2.5. Security Service Negotiation 2.5. Security Service Negotiation
With the NFS version 4 server potentially offering multiple security With the NFS version 4 server potentially offering multiple security
mechanisms, the client needs a method to determine or negotiate which mechanisms, the client needs a method to determine or negotiate which
mechanism is to be used for its communication with the server. The mechanism is to be used for its communication with the server. The
NFS server may have multiple points within its file system namespace NFS server may have multiple points within its file system namespace
that are available for use by NFS clients. These points can be that are available for use by NFS clients. These points can be
considered security policy boundaries, and in some NFS considered security policy boundaries, and in some NFS
implementations are tied to NFS export points. In turn the NFS implementations are tied to NFS export points. In turn the NFS
server may be configured such that each of these security policy server may be configured such that each of these security policy
boundaries may have different or multiple security mechanisms in use. boundaries may have different or multiple security mechanisms in use.
The security negotiation between client and server must be done with The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired. server to choose a lower level of security than required or desired.
See section Section 19 for further discussion. See Section 19 for further discussion.
2.5.1. NFSv4 Security Tuples 2.5.1. NFSv4 Security Tuples
An NFS server can assign one or more "security tuples" to each An NFS server can assign one or more "security tuples" to each
security policy boundary in its namespace. Each security tuple security policy boundary in its namespace. Each security tuple
consists of a security flavor (see Section 2.2.1.1), and if the consists of a security flavor (see Section 2.2.1.1), and if the
flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of
protection, and an RPCSEC_GSS service. protection, and an RPCSEC_GSS service.
2.5.2. SECINFO and SECINFO_NO_NAME 2.5.2. SECINFO and SECINFO_NO_NAME
skipping to change at page 25, line 51 skipping to change at page 26, line 4
flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of
protection, and an RPCSEC_GSS service. protection, and an RPCSEC_GSS service.
2.5.2. SECINFO and SECINFO_NO_NAME 2.5.2. SECINFO and SECINFO_NO_NAME
The SECINFO and SECINFO_NO_NAME operations allow the client to The SECINFO and SECINFO_NO_NAME operations allow the client to
determine, on a per filehandle basis, what security tuple is to be determine, on a per filehandle basis, what security tuple is to be
used for server access. In general, the client will not have to use used for server access. In general, the client will not have to use
either operation except during initial communication with the server either operation except during initial communication with the server
or when the client crosses security policy boundaries at the server. or when the client crosses security policy boundaries at the server.
It is possible that the server's policies change during the client's It is possible that the server's policies change during the client's
interaction therefore forcing the client to negotiate a new security interaction therefore forcing the client to negotiate a new security
tuple. tuple.
Where the use of different security tuples would affect the type of
access that would be allowed if a request was issued over the same
connection used for the SECINFO or SECINFO_NO_NAME operation (e.g.
read-only vs. read-write) access, security tuples that allow greater
access should be presented first. Where the general level of access
is the same and different security flavors limit the range of
principals whose privileges are recognized (e.g. allowing or
disallowing root access), flavors supporting the greatest range of
principals should be listed first.
2.5.3. Security Error 2.5.3. Security Error
Based on the assumption that each NFS version 4 client and server Based on the assumption that each NFS version 4 client and server
must support a minimum set of security (i.e., LIPKEY, SPKM-3, and must support a minimum set of security (i.e., LIPKEY, SPKM-3, and
Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file
access to the server with one of the minimal security tuples. During access to the server with one of the minimal security tuples. During
communication with the server, the client may receive an NFS error of communication with the server, the client may receive an NFS error of
NFS4ERR_WRONGSEC. This error allows the server to notify the client NFS4ERR_WRONGSEC. This error allows the server to notify the client
that the security tuple currently being used is contravenes the that the security tuple currently being used is contravenes the
server's security policy. The client is then responsible for server's security policy. The client is then responsible for
skipping to change at page 36, line 35 skipping to change at page 36, line 36
2.9.2.2. Clientid and Session Association 2.9.2.2. Clientid and Session Association
Sessions are subordinate to the clientid (Section 2.4). Each Sessions are subordinate to the clientid (Section 2.4). Each
clientid can have zero or more active sessions. A clientid, and a clientid can have zero or more active sessions. A clientid, and a
session bound to it are required to do anything useful in NFSv4.1. session bound to it are required to do anything useful in NFSv4.1.
Each time a session is used, the state leased to it associated Each time a session is used, the state leased to it associated
clientid is automatically renewed. clientid is automatically renewed.
State such as share reservations, locks, delegations, and layouts State such as share reservations, locks, delegations, and layouts
(Section 1.5.4) is tied to the clientid, not the sessions of the (Section 1.4.4) is tied to the clientid, not the sessions of the
clientid. Successive state changing operations from a given state clientid. Successive state changing operations from a given state
owner can go over different sessions, as long each session is owner can go over different sessions, as long each session is
associated with the same clientid. Callbacks can arrive over a associated with the same clientid. Callbacks can arrive over a
different session than the session that sent the operation the different session than the session that sent the operation the
acquired the state that the callback is for. For example, if session acquired the state that the callback is for. For example, if session
A is used to acquire a delegation, a request to recall the delegation A is used to acquire a delegation, a request to recall the delegation
can arrive over session B. can arrive over session B.
2.9.3. Channels 2.9.3. Channels
skipping to change at page 37, line 28 skipping to change at page 37, line 29
Because there are at most two channels per session, and because each Because there are at most two channels per session, and because each
channel has a distinct purpose, channels are not assigned channel has a distinct purpose, channels are not assigned
identifiers. The operation and backchannel are implicitly created identifiers. The operation and backchannel are implicitly created
and associated when the session is created. and associated when the session is created.
2.9.3.4. Connection and Channel Association 2.9.3.4. Connection and Channel Association
Each channel is associated with zero or more transport connections. Each channel is associated with zero or more transport connections.
A connection can be bound to one channel or both channels of a A connection can be bound to one channel or both channels of a
session; the client and server negotiate whether a connection will session; the client and server negotiate whether a connection will
carry traffic for one channel or both channel via the the carry traffic for one channel or both channels via the CREATE_SESSION
CREATE_SESSION (Section 16.36) and the BIND_CONN_TO_SESSION (Section 16.36) and the BIND_CONN_TO_SESSION (Section 16.34)
(Section 16.34) operations. When a session is created via operations. When a session is created via CREATE_SESSION, it is
CREATE_SESSION, it is automatically bound to the operation channel, automatically bound to the operation channel, and optionally the
and optionally the backchannel. If the client does not specify backchannel. If the client does not specify connecting binding
connecting binding enforcement when the session is created, then enforcement when the session is created, then additional connections
additional connections are automatically bound to the operation are automatically bound to the operation channel when the are used
channel when the are used with a SEQUENCE operation that has the with a SEQUENCE operation that has the session's sessionid.
session's sessionid.
A connection MAY be bound to the channels of other sessions. The A connection MAY be bound to the channels of other sessions. The
client decides, and the NFSv4.1 server MUST allow it. A connection client decides, and the NFSv4.1 server MUST allow it. A connection
MAY be bound to the the channels' of other sessions of other MAY be bound to the channels of other sessions of other clientids.
clientids. Again, the client decides, and the server MUST allow it. Again, the client decides, and the server MUST allow it.
It is permissible for connections of multiple types to be bound to It is permissible for connections of multiple types to be bound to
the same channel. For example a TCP and RDMA connection can be bound the same channel. For example a TCP and RDMA connection can be bound
to the operation channel. In the event an RDAM and non-RDMA to the operation channel. In the event an RDMA and non-RDMA
connection are bound to the same channel, the maximum number of slots connection are bound to the same channel, the maximum number of slots
must be at least one more than the total number of credits. This way must be at least one more than the total number of credits. This way
if all RDMA credits are use, the non-RDMA connection can have at if all RDMA credits are use, the non-RDMA connection can have at
least one outstanding request. least one outstanding request.
It is permissible for a connection of one type to be bound to the It is permissible for a connection of one type to be bound to the
operation channel, and another type bound to the backchannel. operation channel, and another type bound to the backchannel.
2.9.3.4.1. Trunking 2.9.3.4.1. Trunking
Since multiple connections can be bound to a session's channel, these The eir_server_owner results from EXCHANGE_ID give a client a hint
means that traffic between an NFSv4.1 client and server channel goes that the server it is connected to may be the same as the server it
over all connections. If the connections are over different network is connected to via another connection. When two connections have
paths, this is trunking. NFSv4.1 allows trunking, thus allows the the same eir_server_owner.so_major_id, the client treats the
bandwidth capacity to scale with the number of connections. connections as connected to the same server (even if the destination
network addresses are different) and uses a common clientid to
identify itself. The eir_server_owner.so_minor_id field allows the
server to control binding of connections to sessions. When two
connections have a matching so_major_id and so_minor_id, the client
may bind both connections to a common session; this is session
trunking. When two connections have a matching so_major_id, but
different so_minor_id, the client will need to create a new session
for the clientid in order to use the connection; this is clientid
trunking. In either session or clientid trunking, the bandwidth
capacity can scale with the number of connections.
At issue is how do NFSv4.1 clients and servers discover and verify Just because two servers over two connections claim matching or
multiple paths? On the client side, each client should be aware of partially matching server_owner4 values does not the client should or
the network interfaces it has available from which to create must trust the servers' claims. The client may verify these claims
connections. However, the client cannot always be certain whether a before trunking traffic.
server's multitide of network interfaces in fact belong to the same
server, or even if they do, whether the server is prepared to share a
clientid or sessionid across all its interfaces. NFSv4.1 provides no
discovery protocol for allowing servers to advertise multiple network
interfaces; such a protocol is problematic because network address
translation (NAT) may be occurring between the client and server, and
so, unless the NAT devices are inspecting NFSv4.1 traffic, the
network addresses the server offers to the client would be
meaningless. At best, short of manual configuration, an NFSv4.1
client could use a host name to network address directory (e.g. DNS)
to enumerate a server's network interfaces. This then leaves the
problem of verification.
NFSv4.1 provides a way for clients and servers to reliably verify if For session trunking, clients and servers can reliably verify if
connections between different network paths are in fact bound to the connections between different network paths are in fact bound to the
same NFSv4.1 server. The SET_SSV (Section 16.47) operation allows a same NFSv4.1 server and usable on the same session. The SET_SSV
client and server to establish a unique, shared key value (the SSV). (Section 16.47) operation allows a client and server to establish a
When a new connection is bound to the session (via the unique, shared key value (the SSV). When a new connection is bound
BIND_CONN_TO_SESSION operation, see Section 16.34), the client must to the session (via the BIND_CONN_TO_SESSION operation, see
offer a digest that based on the SSV. If the client mistakenly tries Section 16.34), the client offers a digest that based on the SSV. If
to bind a connection to a session of a wrong server, the server will the client mistakenly tries to bind a connection to a session of a
either reject the attempt because it is not aware of the session wrong server, the server will either reject the attempt because it is
identifier of the BIND_CONN_TO_SESSION arguments, or it will reject not aware of the session identifier of the BIND_CONN_TO_SESSION
the attempt because the digest for the SSV does not match what the arguments, or it will reject the attempt because the digest for the
server expects. Even if the server mistakenly or maliciously accept SSV does not match what the server expects. Even if the server
the connection bind attempt, the digest it computes in the response mistakenly or maliciously accepts the connection bind attempt, the
will not be verified by the client, the client will know it cannot digest it computes in the response will not be verified by the
use the connection for trunking the specified channel. client, the client will know it cannot use the connection for
trunking the specified channel.
In the case of clientid trunking, the client can use RPCSEC_GSS to
verify that each connection is aimed at the same server. When the
client invokes EXCHANGE_ID, it should use RPCSEC_GSS. If each
RPCSEC_GSS context over each connection has the same server
principal, then the servers at the end of each connection are the
same.
2.9.4. Exactly Once Semantics 2.9.4. Exactly Once Semantics
Via the session, NFSv4.1 offers exactly once semantics (EOS) for Via the session, NFSv4.1 offers exactly once semantics (EOS) for
requests sent over a channel. EOS is supported on both the operation requests sent over a channel. EOS is supported on both the operation
and back channels. and back channels.
Each COMPOUND or CB_COMPOUND request that is issued with a leading
SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver
exactly once. This requirement is regardless whether the request is
issued with reply caching specified (see Section 2.9.4.1.2). The
requirement holds even if the requester is issuing the request over a
session created between a pNFS data client and pNFS data server. The
rationale for this requirement is understood by categorizing requests
into three classifications:
o Nonidempotent requests.
o Idempotent modifying requests.
o Idempotent non-modifying requests.
An example of a non-idempotent request is RENAME. If is obvious that
if a replier executes the same RENAME request twice, and the first
execution succeeds, the re-execution will fail. If the replier
returns the result from the re-execution, this result is incorrect.
Therefore, EOS is required for nonidempotent requests.
An example of an idempotent modifying request is a COMPOUND request
containing a WRITE operation. Repeated execution of the same WRITE
has the same effect as execution of that write once. Nevertheless,
putting enforcing EOS for WRITEs and other idempotent modifying
requests is necessary to avoid data corruption.
Suppose a client issues WRITEs A, B, C to a noncompliant server that
does not enforce EOS, and receives no response, perhaps due to a
network partition. The client reconnects to the server and re-issues
all three WRITEs. Now, the server has outstanding two instances of
each of A, B, and C. The server can be in a situation in which it
executes and replies to the retries of A, B, and C while the first A,
B, and C are still waiting around in the server's I/O system for some
resource. Upon receiving the replies to the second attempts of
WRITEs A, B, and C, the client believes its writes are done so it is
free to do issue WRITE D which overlaps the range of one or more of
A, B, C. If any of A, B, or C are subsequently are executed for the
second time, then what has been written by D can be overwritten and
thus corrupted.
Note that it is not required the server cache the reply to the
modifying operation to avoid data corruption (but if the client
specified the reply to be cached, the server must cache it).
An example of an idempotent non-modifying request is a COMPOUND
containing SEQUENCE, PUTFH, READLINK and nothing else. The re-
execution of a such a request will not cause data corruption, or
produce an incorrect result. Nonetheless, for simplicity, the
replier MUST enforce EOS for such requests.
2.9.4.1. Slot Identifiers and Reply Cache 2.9.4.1. Slot Identifiers and Reply Cache
The RPC layer provides a transaction ID (xid), which, while required The RPC layer provides a transaction ID (xid), which, while required
to be unique, is not especially convenient for tracking requests. to be unique, is not especially convenient for tracking requests.
The xid is only meaningful to the requester it cannot be interpreted The xid is only meaningful to the requester it cannot be interpreted
at the replier except to test for equality with previously issued at the replier except to test for equality with previously issued
requests. Because RPC operations may be completed by the replier in requests. Because RPC operations may be completed by the replier in
any order, many transaction IDs may be outstanding at any time. The any order, many transaction IDs may be outstanding at any time. The
requester may therefore perform a computationally expensive lookup requester may therefore perform a computationally expensive lookup
operation in the process of demultiplexing each reply. operation in the process of demultiplexing each reply.
skipping to change at page 44, line 36 skipping to change at page 45, line 43
among them pnfs layout recalls, described in Section 12.3.5.3 among them pnfs layout recalls, described in Section 12.3.5.3
[[Comment.8: fill in the blanks w/others, etc...]] [[Comment.8: fill in the blanks w/others, etc...]]
2.9.4.4. COMPOUND and CB_COMPOUND Construction Issues 2.9.4.4. COMPOUND and CB_COMPOUND Construction Issues
Very large requests and replies may pose both buffer management Very large requests and replies may pose both buffer management
issues (especially with RDMA) and reply cache issues. When the issues (especially with RDMA) and reply cache issues. When the
session is created, (Section 16.36) the client and server negotiate session is created, (Section 16.36) the client and server negotiate
the maximum sized request they will send or process the maximum sized request they will send or process
(ca_maxrequestsize), the maximum sized reply they will return or (ca_maxrequestsize), the maximum sized reply they will return or
process (ca_maxresponsesize), and the the maximum sized reply they process (ca_maxresponsesize), and the maximum sized reply they will
will store in the reply cache (ca_maxresponsesize_cached). store in the reply cache (ca_maxresponsesize_cached).
If a request exceeds ca_maxrequestsize, the reply will have the If a request exceeds ca_maxrequestsize, the reply will have the
status NFS4ERR_REQ_TOO_BIG. A replier may return NFS4ERR_REQ_TOO_BIG status NFS4ERR_REQ_TOO_BIG. A replier may return NFS4ERR_REQ_TOO_BIG
as the status for first operation (SEQUENCE or CB_SEQUENCE) in the as the status for first operation (SEQUENCE or CB_SEQUENCE) in the
request, or it may chose to return it on a subsequent operation. request, or it may chose to return it on a subsequent operation.
If a reply exceeds ca_maxresponsesize, the reply will have the status If a reply exceeds ca_maxresponsesize, the reply will have the status
NFS4ERR_REP_TOO_BIG. A replier may return NFS4ERR_REP_TOO_BIG as the NFS4ERR_REP_TOO_BIG. A replier may return NFS4ERR_REP_TOO_BIG as the
status for first operation (SEQUENCE or CB_SEQUENCE) in the request, status for first operation (SEQUENCE or CB_SEQUENCE) in the request,
or it may chose to return it on a subsequent operation. or it may chose to return it on a subsequent operation.
skipping to change at page 45, line 17 skipping to change at page 46, line 24
is returned on a operation other than first operation (SEQUENCE or is returned on a operation other than first operation (SEQUENCE or
CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or
csa_cachethis are TRUE. For example, if a COMPOUND has eleven csa_cachethis are TRUE. For example, if a COMPOUND has eleven
operations, including SEQUENCE, the fifth operation is a RENAME, and operations, including SEQUENCE, the fifth operation is a RENAME, and
the tenth operation is a READ for one million bytes, server may the tenth operation is a READ for one million bytes, server may
return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since
the server executed several operations, especially the non-idempotent the server executed several operations, especially the non-idempotent
RENAME, the client's request to cache the reply needs to be honored RENAME, the client's request to cache the reply needs to be honored
in order for correct operation of exactly once semantics. If the in order for correct operation of exactly once semantics. If the
client retries the request, the server will have cached a reply that client retries the request, the server will have cached a reply that
contains results for ten of the elevent requested operations, with contains results for ten of the eleven requested operations, with the
the tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE.
A client needs to take care that when sending operations that change A client needs to take care that when sending operations that change
the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOFFH) the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOFFH)
that it not exceed the maximum reply buffer before the GETFH that it not exceed the maximum reply buffer before the GETFH
operation. Otherwise the client will have to retry the operation operation. Otherwise the client will have to retry the operation
that changed the current filehandle, in order obtain the desired that changed the current filehandle, in order obtain the desired
filehandle. For the OPEN operation (see Section 16.16), retry is not filehandle. For the OPEN operation (see Section 16.16), retry is not
always available as an option. The following guidelines for the always available as an option. The following guidelines for the
handling of filehandle changing operations are advised: handling of filehandle changing operations are advised:
skipping to change at page 45, line 48 skipping to change at page 47, line 8
o A server SHOULD return NFS4ERR_REP_TOO_BIG or o A server SHOULD return NFS4ERR_REP_TOO_BIG or
NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a
filehandle changing non-idempotent operation if the reply would be filehandle changing non-idempotent operation if the reply would be
too large on the next operation, especially if the operation is too large on the next operation, especially if the operation is
OPEN. OPEN.
o A server MAY return NFS4ERR_UNSAFE_COMPOUND if it looks at the o A server MAY return NFS4ERR_UNSAFE_COMPOUND if it looks at the
next operation after a non-idempotent current filehandle changing next operation after a non-idempotent current filehandle changing
operation, and finds it is not GETFH. The server would do this if operation, and finds it is not GETFH. The server would do this if
it it unable to determine in advance whether the total response it is unable to determine in advance whether the total response
size would exceed ca_maxresponsesize_cached or ca_maxresponsesize. size would exceed ca_maxresponsesize_cached or ca_maxresponsesize.
2.9.4.5. Persistence 2.9.4.5. Persistence
Since the reply cache is bounded, it is practical for the server Since the reply cache is bounded, it is practical for the server
reply cache to persist across server reboots, and to be kept in reply cache to persist across server reboots, and to be kept in
stable storage (a client's reply cache for callbacks need not persist stable storage (a client's reply cache for callbacks need not persist
across client reboots unless the client intends for its session and across client reboots unless the client intends for its session and
other state to persist across reboots). other state to persist across reboots).
o The slot table including the sequenceid and cached reply for each o The slot table including the sequenceid and cached reply for each
slot. slot.
o The sessionid. o The sessionid.
o The clientid. o The clientid.
o The SSV (see section Section 2.9.6.3). o The SSV (see Section 2.9.6.3).
The CREATE_SESSION (see Section 16.36 operation determines the The CREATE_SESSION (see Section 16.36 operation determines the
persistence of the reply cache. persistence of the reply cache.
2.9.5. RDMA Considerations 2.9.5. RDMA Considerations
A complete discussion of the operation of RPC-based protocols atop A complete discussion of the operation of RPC-based protocols atop
RDMA transports is in [RPCRDMA]. A discussion of the operation of RDMA transports is in [RPCRDMA]. A discussion of the operation of
NFSv4, including NFSv4.1 over RDMA is in [NFSDDP]. Where RDMA is NFSv4, including NFSv4.1 over RDMA is in [NFSDDP]. Where RDMA is
considered, this specification assumes the use of such a layering; it considered, this specification assumes the use of such a layering; it
skipping to change at page 53, line 36 skipping to change at page 54, line 43
that requires a session, but nonetheless is not sending operations that requires a session, but nonetheless is not sending operations
risks having the session be destroyed by the server. This is risks having the session be destroyed by the server. This is
because sessions consume resources, and resource limitations may because sessions consume resources, and resource limitations may
force the server to cull the least recently used session. force the server to cull the least recently used session.
o Destroy the session when idle. When a session has no state other o Destroy the session when idle. When a session has no state other
than the session, and no outstanding requests, the client should than the session, and no outstanding requests, the client should
consider destroying the session. consider destroying the session.
o Maintain GSS contexts for callback. If the client requires the o Maintain GSS contexts for callback. If the client requires the
server to to use the RPCSEC_GSS security flavor for callbacks, server to use the RPCSEC_GSS security flavor for callbacks, then
then it needs to be sure the contexts handed to the server via it needs to be sure the contexts handed to the server via
BACKCHANNEL_CTL are unexpired. A good practice is to keep at BACKCHANNEL_CTL are unexpired. A good practice is to keep at
least two contexts outstanding, where the expiration time of the least two contexts outstanding, where the expiration time of the
newest context at the time it was created, is N times that of the newest context at the time it was created, is N times that of the
oldest context, where N is the number of contexts available for oldest context, where N is the number of contexts available for
callbacks. callbacks.
o Maintain an active connection. The server requires a callback o Maintain an active connection. The server requires a callback
path in order to gracefully recall recallable state, or notify the path in order to gracefully recall recallable state, or notify the
client of certain events. client of certain events.
2.9.7.3. Steps the Client Takes To Establish a Session 2.9.7.3. Steps the Client Takes To Establish a Session
The client issues CREATE_CLIENTID to establish a clientid. The client issues EXCHANGE_ID to establish a clientid.
The client uses the clientid to issue a CREATE_SESSION on a The client uses the clientid to issue a CREATE_SESSION on a
connection to the server. The results of CREATE_SESSION indicate connection to the server. The results of CREATE_SESSION indicate
whether the server will persist the session replay cache through a whether the server will persist the session replay cache through a
server reboot or not, and the client notes this for future reference. server reboot or not, and the client notes this for future reference.
The client SHOULD have specified connecting binding enforcement when The client SHOULD have specified connecting binding enforcement when
the session was created. If so, the client SHOULD issue SET_SSV in the session was created. If so, the client SHOULD issue SET_SSV in
the first COMPOUND after the session is created. If it is not using the first COMPOUND after the session is created. If it is not using
machine credentials, then each time a new principal goes to use the machine credentials, then each time a new principal goes to use the
skipping to change at page 55, line 19 skipping to change at page 56, line 26
Section 2.9.4.2. Note that it is not necessary to retry requests Section 2.9.4.2. Note that it is not necessary to retry requests
over a connection with the same source network address or the same over a connection with the same source network address or the same
destination network address as the disconnected connection. As long destination network address as the disconnected connection. As long
as the sessionid, slotid, and sequenceid in the retry match that of as the sessionid, slotid, and sequenceid in the retry match that of
the original request, the server will recognize the request as a the original request, the server will recognize the request as a
retry if it did see the request prior to disconnect. retry if it did see the request prior to disconnect.
If the connection that was bound to the backchannel is lost, the If the connection that was bound to the backchannel is lost, the
client may need to reconnect, and use BIND_CONN_TO_SESSION, to give client may need to reconnect, and use BIND_CONN_TO_SESSION, to give
the connection to the backchannel. If the connection that was lost the connection to the backchannel. If the connection that was lost
was the last one bound to the backchannel, the the client MUST was the last one bound to the backchannel, the client MUST reconnect,
reconnect, and bind the connection to the session and backchannel. and bind the connection to the session and backchannel. The server
The server should indicate when it has no callback connection via the should indicate when it has no callback connection via the sr_status
sr_status result from SEQUENCE. result from SEQUENCE.
2.9.8.1.3. Backchannel GSS Context Loss 2.9.8.1.3. Backchannel GSS Context Loss
Via the sr_status result of the SEQUENCE operation or other means, Via the sr_status result of the SEQUENCE operation or other means,
the client will learn if some or all of the RPCSEC_GSS contexts it the client will learn if some or all of the RPCSEC_GSS contexts it
assigned to the backchannel have been lost. The client may need to assigned to the backchannel have been lost. The client may need to
use BACKCHANNEL_CTL to assign new contexts. It MUST assign new use BACKCHANNEL_CTL to assign new contexts. It MUST assign new
contexts if there are no more contexts. contexts if there are no more contexts.
2.9.8.1.4. Loss of Session 2.9.8.1.4. Loss of Session
skipping to change at page 56, line 30 skipping to change at page 57, line 36
Note that loss of session does not imply loss of lock, open, Note that loss of session does not imply loss of lock, open,
delegation, or layout state. Nor does loss of lock, open, delegation, or layout state. Nor does loss of lock, open,
delegation, or layout state imply loss of session state. delegation, or layout state imply loss of session state.
[[Comment.12: Add reference to lock recovery section]] . A session [[Comment.12: Add reference to lock recovery section]] . A session
can survive a server reboot, but lock recovery may still be needed. can survive a server reboot, but lock recovery may still be needed.
The converse is also true. The converse is also true.
It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID
(for example the server reboots and does not preserve clientid (for example the server reboots and does not preserve clientid
state). If so, the client needs to call CREATE_CLIENTID, followed by state). If so, the client needs to call EXCHANGE_ID, followed by
CREATE_SESSION. CREATE_SESSION.
2.9.8.1.5. Failover 2.9.8.1.5. Failover
[[Comment.13: Dave Noveck requested this section; not sure what is [[Comment.13: Dave Noveck requested this section; not sure what is
needed here if this refers to failover to a replica. What are the needed here if this refers to failover to a replica. What are the
session ramifications?]] session ramifications?]]
2.9.8.2. Events Requiring Server Action 2.9.8.2. Events Requiring Server Action
The following events require server action to recover. The following events require server action to recover.
2.9.8.2.1. Client Crash and Reboot 2.9.8.2.1. Client Crash and Reboot
As described in Section 16.35, a rebooted client causes the server to As described in Section 16.35, a rebooted client causes the server to
delete any sessions it had. delete any sessions it had.
2.9.8.2.2. Client Crash with No Reboot 2.9.8.2.2. Client Crash with No Reboot
If a client crashes and never comes back, it will never issue If a client crashes and never comes back, it will never issue
CREATE_CLIENTID with its old clientid. Thus the server has session EXCHANGE_ID with its old clientid. Thus the server has session state
state that will never be used again. After an extended period of that will never be used again. After an extended period of time and
time and if the server has resource constraints, it MAY destroy the if the server has resource constraints, it MAY destroy the old
old session. session.
2.9.8.2.3. Extended Network Partition 2.9.8.2.3. Extended Network Partition
To the server, the extended network partition may be no different To the server, the extended network partition may be no different
than a client crash with no reboot (see Section 2.9.8.2.2). Unless than a client crash with no reboot (see Section 2.9.8.2.2). Unless
the server can discern that there is a network partition, it is free the server can discern that there is a network partition, it is free
to treat the situation as if the client has crashed for good. to treat the situation as if the client has crashed for good.
2.9.8.2.4. Backchannel Connection Loss 2.9.8.2.4. Backchannel Connection Loss
skipping to change at page 57, line 32 skipping to change at page 58, line 43
that of the original request, the callback target will recognize the that of the original request, the callback target will recognize the
request as a retry if it did see the request prior to disconnect. request as a retry if it did see the request prior to disconnect.
If the connection lost is the last one bound to the backchannel, then If the connection lost is the last one bound to the backchannel, then
the server MUST indicate that in the sr_status field of the next the server MUST indicate that in the sr_status field of the next
SEQUENCE reply. SEQUENCE reply.
2.9.8.2.5. GSS Context Loss 2.9.8.2.5. GSS Context Loss
The server SHOULD monitor when the last RPCSEC_GSS context assigned The server SHOULD monitor when the last RPCSEC_GSS context assigned
to the backchannel is near expiry (i.e between one and two periods of to the backchannel is near expiry (i.e. between one and two periods
lease time), and indicate so in the sr_status field of the next of lease time), and indicate so in the sr_status field of the next
SEQUENCE reply. The server MUST indicate when the backchannel's last SEQUENCE reply. The server MUST indicate when the backchannel's last
RPCSEC_GSS context has expired in the sr_status field of the next RPCSEC_GSS context has expired in the sr_status field of the next
SEQUENCE reply. SEQUENCE reply.
3. Protocol Data Types 3. Protocol Data Types
The syntax and semantics to describe the data types of the NFS The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831 version 4 protocol are defined in the XDR RFC4506 [3] and RPC RFC1831
[4] documents. The next sections build upon the XDR data types to [4] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol. define types and structures specific to this protocol.
skipping to change at page 66, line 18 skipping to change at page 67, line 18
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layouttype4 lo_type; layouttype4 lo_type;
opaque lo_layout<>; opaque lo_layout<>;
}; };
The layout4 structure defines a layout for a file. The layout type The layout4 structure defines a layout for a file. The layout type
specific data is opaque within this structure and must be specific data is opaque within this structure and must be
interepreted based on the layout type. Currently, only the NFSv4 interepreted based on the layout type. Currently, only the NFSv4
file layout type is defined; see Section 12.4.1 for its definition. file layout type is defined; see Section 12.4.2 for its definition.
Since layouts are sub-dividable, the offset and length together with Since layouts are sub-dividable, the offset and length together with
the file's filehandle, the clientid, iomode, and layout type, the file's filehandle, the clientid, iomode, and layout type,
identifies the layout. identifies the layout.
3.2.22. layoutupdate4 3.2.22. layoutupdate4
struct layoutupdate4 { struct layoutupdate4 {
layouttype4 lou_type; layouttype4 lou_type;
opaque lou_data<>; opaque lou_data<>;
}; };
skipping to change at page 67, line 5 skipping to change at page 68, line 5
opaque loh_data<>; opaque loh_data<>;
}; };
The layouthint4 structure is used by the client to pass in a hint The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file. about the type of layout it would like created for a particular file.
It is the structure specified by the FILE_LAYOUT_HINT attribute It is the structure specified by the FILE_LAYOUT_HINT attribute
described below. The metadata server may ignore the hint, or may described below. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within provided at create time as part of the initial attributes within
OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint" OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint"
structure as defined in Section 12.4.1. structure as defined in Section 12.4.2.
3.2.24. layoutiomode4 3.2.24. layoutiomode4
enum layoutiomode4 { enum layoutiomode4 {
LAYOUTIOMODE_READ = 1, LAYOUTIOMODE_READ = 1,
LAYOUTIOMODE_RW = 2, LAYOUTIOMODE_RW = 2,
LAYOUTIOMODE_ANY = 3 LAYOUTIOMODE_ANY = 3
}; };
The iomode specifies whether the client intends to read or write The iomode specifies whether the client intends to read or write
skipping to change at page 79, line 31 skipping to change at page 80, line 31
| | | | | getattr during | | | | | | getattr during |
| | | | | readdir. | | | | | | readdir. |
| filehandle | 19 | nfs_fh4 | READ | The filehandle of | | filehandle | 19 | nfs_fh4 | READ | The filehandle of |
| | | | | this object | | | | | | this object |
| | | | | (primarily for | | | | | | (primarily for |
| | | | | readdir requests). | | | | | | readdir requests). |
+-----------------+----+------------+--------+----------------------+ +-----------------+----+------------+--------+----------------------+
5.6. Recommended Attributes - Definitions 5.6. Recommended Attributes - Definitions
+--------------------+----+---------------+--------+----------------+ +-------------------+----+----------------+--------+----------------+
| name | # | Data Type | Access | Description | | name | # | Data Type | Access | Description |
+--------------------+----+---------------+--------+----------------+ +-------------------+----+----------------+--------+----------------+
| ACL | 12 | nfsace4<> | R/W | The access | | ACL | 12 | nfsace4<> | R/W | The access |
| | | | | control list | | | | | | control list |
| | | | | for the | | | | | | for the |
| | | | | object. | | | | | | object. |
| aclsupport | 13 | uint32 | READ | Indicates what | | aclsupport | 13 | uint32 | READ | Indicates what |
| | | | | types of ACLs | | | | | | types of ACLs |
| | | | | are supported | | | | | | are supported |
| | | | | on the current | | | | | | on the current |
| | | | | file system. | | | | | | file system. |
| archive | 14 | bool | R/W | True, if this | | archive | 14 | bool | R/W | True, if this |
skipping to change at page 81, line 4 skipping to change at page 82, line 4
| | | | | operating | | | | | | operating |
| | | | | environments | | | | | | environments |
| | | | | or in Windows | | | | | | or in Windows |
| | | | | 2000 the "Take | | | | | | 2000 the "Take |
| | | | | Ownership" | | | | | | Ownership" |
| | | | | privilege). | | | | | | privilege). |
| dir_notif_delay | 56 | nfstime4 | READ | notification | | dir_notif_delay | 56 | nfstime4 | READ | notification |
| | | | | delays on | | | | | | delays on |
| | | | | directory | | | | | | directory |
| | | | | attributes | | | | | | attributes |
| dirent_notif_delay | 57 | nfstime4 | READ | notification | | dirent_ | 57 | nfstime4 | READ | notification |
| | | | | delays on | | notif_delay | | | | delays on |
| | | | | child | | | | | | child |
| | | | | attributes | | | | | | attributes |
| fileid | 20 | uint64 | READ | A number | | fileid | 20 | uint64 | READ | A number |
| | | | | uniquely | | | | | | uniquely |
| | | | | identifying | | | | | | identifying |
| | | | | the file | | | | | | the file |
| | | | | within the | | | | | | within the |
| | | | | file system. | | | | | | file system. |
| files_avail | 21 | uint64 | READ | File slots | | files_avail | 21 | uint64 | READ | File slots |
| | | | | available to | | | | | | available to |
skipping to change at page 85, line 23 skipping to change at page 86, line 23
| | | | | NF4CHR, the | | | | | | NF4CHR, the |
| | | | | value return | | | | | | value return |
| | | | | SHOULD NOT be | | | | | | SHOULD NOT be |
| | | | | considered | | | | | | considered |
| | | | | useful. | | | | | | useful. |
| recv_impl_id | 59 | impl_ident4 | READ | Client obtains | | recv_impl_id | 59 | impl_ident4 | READ | Client obtains |
| | | | | the server's | | | | | | the server's |
| | | | | implementation | | | | | | implementation |
| | | | | identity via | | | | | | identity via |
| | | | | GETATTR. | | | | | | GETATTR. |
| retentevt_get | 71 | retention_get4 | READ | Get the |
| | | | | event-based |
| | | | | retention |
| | | | | duration, and |
| | | | | if enabled, |
| | | | | the |
| | | | | event-based |
| | | | | retention |
| | | | | begin time of |
| | | | | the file |
| | | | | object. |
| | | | | GETATTR use |
| | | | | only. |
| retentevt_set | 72 | retention_set4 | WRITE | Set the |
| | | | | event-based |
| | | | | retention |
| | | | | duration, and |
| | | | | optionally |
| | | | | enable |
| | | | | event-based |
| | | | | retention on |
| | | | | the file |
| | | | | object. |
| | | | | SETATTR use |
| | | | | only. |
| retention_get | 69 | retention_get4 | READ | Get the |
| | | | | retention |
| | | | | duration, and |
| | | | | if enabled, |
| | | | | the retention |
| | | | | begin time of |
| | | | | the file |
| | | | | object. |
| | | | | GETATTR use |
| | | | | only. |
| retention_hold | 69 | uint64_t | R/W | Get or set |
| | | | | administrative |
| | | | | retention |
| | | | | holds, one |
| | | | | hold per bit |
| | | | | position. |
| retention_set | 70 | retention_set4 | WRITE | Set the |
| | | | | retention |
| | | | | duration, and |
| | | | | optionally |
| | | | | enable |
| | | | | retention on |
| | | | | the file |
| | | | | object. |
| | | | | SETATTR use |
| | | | | only. |
| send_impl_id | 58 | impl_ident4 | WRITE | Client | | send_impl_id | 58 | impl_ident4 | WRITE | Client |
| | | | | provides | | | | | | provides |
| | | | | server with | | | | | | server with |
| | | | | its | | | | | | its |
| | | | | implementation | | | | | | implementation |
| | | | | identity via | | | | | | identity via |
| | | | | SETATTR. | | | | | | SETATTR. |
| space_avail | 42 | uint64 | READ | Disk space in | | space_avail | 42 | uint64 | READ | Disk space in |
| | | | | bytes | | | | | | bytes |
| | | | | available to | | | | | | available to |
skipping to change at page 87, line 18 skipping to change at page 89, line 34
| time_modify | 53 | nfstime4 | READ | The time of | | time_modify | 53 | nfstime4 | READ | The time of |
| | | | | last | | | | | | last |
| | | | | modification | | | | | | modification |
| | | | | to the object. | | | | | | to the object. |
| time_modify_set | 54 | settime4 | WRITE | Set the time | | time_modify_set | 54 | settime4 | WRITE | Set the time |
| | | | | of last | | | | | | of last |
| | | | | modification | | | | | | modification |
| | | | | to the object. | | | | | | to the object. |
| | | | | SETATTR use | | | | | | SETATTR use |
| | | | | only. | | | | | | only. |
+--------------------+----+---------------+--------+----------------+ +-------------------+----+----------------+--------+----------------+
5.7. Time Access 5.7. Time Access
As defined above, the time_access attribute represents the time of As defined above, the time_access attribute represents the time of
last access to the object by a read that was satisfied by the server. last access to the object by a read that was satisfied by the server.
The notion of what is an "access" depends on server's operating The notion of what is an "access" depends on server's operating
environment and/or the server's file system semantics. For example, environment and/or the server's file system semantics. For example,
for servers obeying POSIX semantics, time_access would be updated for servers obeying POSIX semantics, time_access would be updated
only by the READLINK, READ, and READDIR operations and not any of the only by the READLINK, READ, and READDIR operations and not any of the
operations that modify the content of the object. Of course, setting operations that modify the content of the object. Of course, setting
skipping to change at page 93, line 18 skipping to change at page 95, line 39
The attribute is available on a per filehandle basis. If the current The attribute is available on a per filehandle basis. If the current
filehandle refers to a non-pNFS file or directory, the metadata filehandle refers to a non-pNFS file or directory, the metadata
server should return an attribute that is representative of the server should return an attribute that is representative of the
filehandle's file system. It is suggested that this attribute is filehandle's file system. It is suggested that this attribute is
queried as part of the OPEN operation. Due to dynamic system queried as part of the OPEN operation. Due to dynamic system
changes, the client should not assume that the attribute will remain changes, the client should not assume that the attribute will remain
constant for any specific time period, thus it should be periodically constant for any specific time period, thus it should be periodically
refreshed. refreshed.
5.17. Retention Attributes
Retention is a concept whereby a file object can be placed in an
immutable, undeletable, unrenamable state for a fixed or infinite
duration of time. Once in this "retained" state, the file cannot be
moved out of the state until the duration of retention has been
reached.
When retention is enabled, retention MUST extend to the data of the
file, and the name of file. The server MAY extend retention any
other property of the file, including any subset of mandatory,
recommended, and named attributes, with the exceptions noted in this
section.
Servers MAY support or not support retention on any file object type.
There are five retention attributes:
o retention_get. This attribute is only readable via GETATTR and
not setable via SETATTR. The value of the attribute consists of:
const RET4_DURATION_INFINITE = 0xffffffffffffffff;
struct retention_get4 {
uint64_t rg_duration;
nfstime4 rg_begin_time<1>;
};
The field rg_duration is duration in seconds indicating how long
the file will be retained once retention is enabled. The field
rg_begin_time is an array of up to one absolute time value. If
the array is zero length, no beginning retention time has been
established, and retention is not enabled. If rg_duration is
equal to RET4_DURATION_INFINITE, the file, once retention is
enabled, will be retained for an infinite duration.
o retention_set. This attribute corresponds to retention_get. This
attribute is only setable via SETATTR and not readable via
GETATTR. The value of the attribute consists of:
struct retention_set4 {
bool rs_enable;
uint64_t rs_duration<1>;
};
If the client sets rs_enable to TRUE, then it is enabling
retention on the file object with the begin time of retention
commencing from the server's current time and date. The duration
of the retention can also be provided if the rs_duration array is
of length one. The duration is time is seconds from the begin
time of retention, and if set to RET4_DURATION_INFINITE, the file
is to be retained forever. If retention is enabled, with no
duration specified in either this SETATTR or a previous SETATTR,
the duration defaults to zero seconds. The server MAY restrict
the enabling of retention or the duration of retention on the
basis of the ACE4_WRITE_RETENTION ACL permission. The enabling of
retention does not prevent the enabling of event-based retention
nor the modification of the retention_hold attribute.
o retentevt_get. This attribute is like retention_get, but refers
to event-based retention. The event that triggers event-based
retention is not defined by the NFSv4.1 specification.
o retentevt_set. This attribute corresponds to retentevt_get, is
like retention_set, but refers to event-based retention. When
event based retention is set, the file MUST be retained even if
non-event-based retention has been set, and the duration of non-
event-based retention has been reached. Conversely, when non-
event-based retention has been set, the file MUST be retained even
the event-based retention has been set, and the duration of event-
based retention has been reached. The server MAY restrict the
enabling of event-based retention or the duration of event-based
retention on the basis of the ACE4_WRITE_RETENTION ACL permission.
The enabling of event-based retention does not prevent the
enabling of non-event-based retention nor the modification of the
retention_hold attribute.
o retention_hold. This attribute allows one to 64 administrative
holds, one hold per bit on the attribute. If retention_hold is
not zero, then the file MUST NOT be deleted, renamed, or modified,
even if the duration on enabled event or non-event-based retention
has been reached. The server MAY restrict the modification of
retention_hold on the basis of the ACE4_WRITE_RETENTION_HOLD ACL
permission. The enabling of administration retention holds does
not prevent the enabling of event-based or non-event-based
retention.
6. Access Control Lists 6. Access Control Lists
Access Control Lists (ACLs) are a file attribute that specify fine Access Control Lists (ACLs) are a file attribute that specify fine
grained access control. This chapter covers the "acl", "aclsupport", grained access control. This chapter covers the "acl", "aclsupport",
and "mode" file attributes, and their interactions. and "mode" file attributes, and their interactions.
6.1. Goals 6.1. Goals
ACLs and modes represent two well established but different models ACLs and modes represent two well established but different models
for specifying permissions. This chapter specifies requirements that for specifying permissions. This chapter specifies requirements that
skipping to change at page 97, line 39 skipping to change at page 102, line 17
const ACE4_WRITE_DATA = 0x00000002; const ACE4_WRITE_DATA = 0x00000002;
const ACE4_ADD_FILE = 0x00000002; const ACE4_ADD_FILE = 0x00000002;
const ACE4_APPEND_DATA = 0x00000004; const ACE4_APPEND_DATA = 0x00000004;
const ACE4_ADD_SUBDIRECTORY = 0x00000004; const ACE4_ADD_SUBDIRECTORY = 0x00000004;
const ACE4_READ_NAMED_ATTRS = 0x00000008; const ACE4_READ_NAMED_ATTRS = 0x00000008;
const ACE4_WRITE_NAMED_ATTRS = 0x00000010; const ACE4_WRITE_NAMED_ATTRS = 0x00000010;
const ACE4_EXECUTE = 0x00000020; const ACE4_EXECUTE = 0x00000020;
const ACE4_DELETE_CHILD = 0x00000040; const ACE4_DELETE_CHILD = 0x00000040;
const ACE4_READ_ATTRIBUTES = 0x00000080; const ACE4_READ_ATTRIBUTES = 0x00000080;
const ACE4_WRITE_ATTRIBUTES = 0x00000100; const ACE4_WRITE_ATTRIBUTES = 0x00000100;
const ACE4_WRITE_RETENTION = 0x00000200;
const ACE4_WRITE_RETENTION_HOLD = 0x00000400;
const ACE4_DELETE = 0x00010000; const ACE4_DELETE = 0x00010000;
const ACE4_READ_ACL = 0x00020000; const ACE4_READ_ACL = 0x00020000;
const ACE4_WRITE_ACL = 0x00040000; const ACE4_WRITE_ACL = 0x00040000;
const ACE4_WRITE_OWNER = 0x00080000; const ACE4_WRITE_OWNER = 0x00080000;
const ACE4_SYNCHRONIZE = 0x00100000; const ACE4_SYNCHRONIZE = 0x00100000;
6.2.1.3.1. Discussion of Mask Attributes 6.2.1.3.1. Discussion of Mask Attributes
ACE4_READ_DATA ACE4_READ_DATA
Operation(s) affected: Operation(s) affected:
skipping to change at page 100, line 45 skipping to change at page 105, line 26
SETATTR of time_access_set, time_backup, SETATTR of time_access_set, time_backup,
time_create, time_modify_set, mimetype, hidden, system time_create, time_modify_set, mimetype, hidden, system
Discussion: Discussion:
Permission to change the times associated with a file Permission to change the times associated with a file
or directory to an arbitrary value. Also permission or directory to an arbitrary value. Also permission
to change the mimetype, hidden and system attributes. to change the mimetype, hidden and system attributes.
A user having ACE4_WRITE_DATA permission, but lacking A user having ACE4_WRITE_DATA permission, but lacking
ACE4_WRITE_ATTRIBUTES must be allowed to implicitly set ACE4_WRITE_ATTRIBUTES must be allowed to implicitly set
the times associated with a file. the times associated with a file.
ACE4_WRITE_RETENTION
Operation(s) affected:
SETATTR of retention_set, retentevt_set.
Discussion:
Permission to modify the durations of event and non-event-based
retention. Also permission to enable event and non-event-based
retention. A server MAY map ACE4_WRITE_ATTRIBUTES to
ACE_WRITE_RETENTION.
ACE4_WRITE_RETENTION_HOLD
Operation(s) affected:
SETATTR of retention_hold.
Discussion:
Permission to modify the administration retention holds.
A server MAY map ACE4_WRITE_ATTRIBUTES to
ACE_WRITE_RETENTION_HOLD.
ACE4_DELETE ACE4_DELETE
Operation(s) affected: Operation(s) affected:
REMOVE REMOVE
Discussion: Discussion:
Permission to delete the file or directory. See section Permission to delete the file or directory. See section
"ACE4_DELETE vs. ACE4_DELETE_CHILD" for information on how "ACE4_DELETE vs. ACE4_DELETE_CHILD" for information on how
these two access mask bits interact. these two access mask bits interact.
ACE4_READ_ACL ACE4_READ_ACL
Operation(s) affected: Operation(s) affected:
skipping to change at page 118, line 14 skipping to change at page 123, line 5
which they may be referenced. The stateid is used as a shorthand which they may be referenced. The stateid is used as a shorthand
reference to a lock or set of locks and given a stateid the client reference to a lock or set of locks and given a stateid the client
can determine the associated state-owner or state-owners (in the case can determine the associated state-owner or state-owners (in the case
of an open-owner/lock-owner pair) and the associated. Clients, of an open-owner/lock-owner pair) and the associated. Clients,
however, must not assume any such mapping and must not use a stateid however, must not assume any such mapping and must not use a stateid
returned for a given filehandle and state-owner in the context of a returned for a given filehandle and state-owner in the context of a
different filehandle or a different state-owner. different filehandle or a different state-owner.
The server is free to form the stateid in any manner that it chooses The server is free to form the stateid in any manner that it chooses
as long as it is able to recognize invalid and out-of-date stateids. as long as it is able to recognize invalid and out-of-date stateids.
Although the protocol XDR definition divides the stateid into into Although the protocol XDR definition divides the stateid into 'seqid'
'seqid' and 'other' fields, for the purposes of minor version one, and 'other' fields, for the purposes of minor version one, this
this distinction is not important and the server may use the distinction is not important and the server may use the available
available space as it chooses, with one exception. space as it chooses, with one exception.
The exception is that stateids whose 'other' field is either all The exception is that stateids whose 'other' field is either all
zeros or all ones are reserved and may not be generated by the zeros or all ones are reserved and may not be generated by the
server. Clients may use the protocol-defined special stateid values server. Clients may use the protocol-defined special stateid values
for their defined purposes, but any use of stateid's in this reserved for their defined purposes, but any use of stateid's in this reserved
class that are not specially defined by the protocol MUST result in class that are not specially defined by the protocol MUST result in
an NFS4ERR_BAD_STATED being returned. an NFS4ERR_BAD_STATED being returned.
Clients may not compare stateids associated with different Clients may not compare stateids associated with different
filehandles, so that a server might use stateids with the same bit filehandles, so that a server might use stateids with the same bit
skipping to change at page 121, line 49 skipping to change at page 126, line 42
are allowed in these circumstances, the server MUST still check for are allowed in these circumstances, the server MUST still check for
locks that conflict with the READ (e.g. another open specify denial locks that conflict with the READ (e.g. another open specify denial
of READs). Note that a server which does enforce the access mode of READs). Note that a server which does enforce the access mode
check on READs need not explicitly check for conflicting share check on READs need not explicitly check for conflicting share
reservations since the existence of OPEN for read access guarantees reservations since the existence of OPEN for read access guarantees
that no conflicting share reservation can exist. that no conflicting share reservation can exist.
A special stateid of all bits 1 (one), including all fields in the A special stateid of all bits 1 (one), including all fields in the
stateid indicates a desire to bypass locking checks. The server MAY stateid indicates a desire to bypass locking checks. The server MAY
allow READ operations to bypass locking checks at the server, when allow READ operations to bypass locking checks at the server, when
this special stateid is used. However, WRITE operations with with this special stateid is used. However, WRITE operations with this
this special stateid value MUST NOT bypass locking checks and are special stateid value MUST NOT bypass locking checks and are treated
treated exactly the same as if a stateid of all bits 0 were used. exactly the same as if a stateid of all bits 0 were used.
A lock may not be granted while a READ or WRITE operation using one A lock may not be granted while a READ or WRITE operation using one
of the special stateids is being performed and the range of the lock of the special stateids is being performed and the range of the lock
request conflicts with the range of the READ or WRITE operation. For request conflicts with the range of the READ or WRITE operation. For
the purposes of this paragraph, a conflict occurs when a shared lock the purposes of this paragraph, a conflict occurs when a shared lock
is requested and a WRITE operation is being performed, or an is requested and a WRITE operation is being performed, or an
exclusive lock is requested and either a READ or a WRITE operation is exclusive lock is requested and either a READ or a WRITE operation is
being performed. A SETATTR that sets size is treated similarly to a being performed. A SETATTR that sets size is treated similarly to a
WRITE as discussed above. WRITE as discussed above.
skipping to change at page 124, line 44 skipping to change at page 129, line 35
restarts or reboots. All READ and WRITE operations that may have restarts or reboots. All READ and WRITE operations that may have
been queued within the client or network buffers must wait until the been queued within the client or network buffers must wait until the
client has successfully recovered the locks protecting the READ and client has successfully recovered the locks protecting the READ and
WRITE operations. WRITE operations.
8.6.1. Client Failure and Recovery 8.6.1. Client Failure and Recovery
In the event that a client fails, the server may release the client's In the event that a client fails, the server may release the client's
locks when the associated leases have expired. Conflicting locks locks when the associated leases have expired. Conflicting locks
from another client may only be granted after this lease expiration. from another client may only be granted after this lease expiration.
When a client has not not failed and re-establishes his lease before When a client has not failed and re-establishes his lease before
expiration occurs, requests for conflicting locks will not be expiration occurs, requests for conflicting locks will not be
granted. granted.
To minimize client delay upon restart, lock requests are associated To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client supplied verifier. This with an instance of the client by a client supplied verifier. This
verifier is part of the initial CREATE_CLIENTID call made by the verifier is part of the initial EXCHANGE_ID call made by the client.
client. The server returns a clientid as a result of the The server returns a clientid as a result of the EXCHANGE_ID
CREATE_CLIENTID operation. The client then confirms the use of the operation. The client then confirms the use of the clientid by
clientid by establishing a session associated with that clientid. establishing a session associated with that clientid. All locks,
All locks, including opens, byte-range locks, delegations, and layout including opens, byte-range locks, delegations, and layout obtained
obtained by sessions using that clientid are associated with that by sessions using that clientid are associated with that clientid.
clientid.
Since the verifier will be changed by the client upon each Since the verifier will be changed by the client upon each
initialization, the server can compare a new verifier to the verifier initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release loss of locking state. As a result, the server is free to release
all locks held which are associated with the old clientid which was all locks held which are associated with the old clientid which was
derived from the old verifier. At this point conflicting locks from derived from the old verifier. At this point conflicting locks from
other clients, kept waiting while the leaser had not yet expired, can other clients, kept waiting while the leaser had not yet expired, can
be granted. be granted.
skipping to change at page 136, line 5 skipping to change at page 140, line 48
There are a number of operations and fields within existing There are a number of operations and fields within existing
operations that no longer have a function in minor version one. In operations that no longer have a function in minor version one. In
one way or another, these changes are all due to the implementation one way or another, these changes are all due to the implementation
of sessions which provides client context and replay protection as a of sessions which provides client context and replay protection as a
base feature of the protocol, separate from locking itself. base feature of the protocol, separate from locking itself.
The following operations have become mandatory-to-not-implement. The The following operations have become mandatory-to-not-implement. The
server should return NFS4ERR_NOTSUPP if these operations are found in server should return NFS4ERR_NOTSUPP if these operations are found in
an NFSv4.1 COMPOUND. an NFSv4.1 COMPOUND.
o SETCLIENTID since its function has been replaced by o SETCLIENTID since its function has been replaced by EXCHANGE_ID.
CREATE_CLIENTID.
o SETCLIENTID_CONFIRM since clientid confirmation now happens by o SETCLIENTID_CONFIRM since clientid confirmation now happens by
means of CREATE_SESSION. means of CREATE_SESSION.
o OPEN_CONFIRM because OPEN's no longer require confirmation to o OPEN_CONFIRM because OPEN's no longer require confirmation to
establish an owner-based sequence value. establish an owner-based sequence value.
o RELEASE_LOCKOWNER because lock-owners with no associated locks o RELEASE_LOCKOWNER because lock-owners with no associated locks
have any sequence-related state and so can be deleted by the have any sequence-related state and so can be deleted by the
server at will. server at will.
skipping to change at page 163, line 28 skipping to change at page 168, line 28
10.1. Location attributes 10.1. Location attributes
NFSv4 contains recommended attributes that allow file systems on one NFSv4 contains recommended attributes that allow file systems on one
server to be associated with one or more instances of that file server to be associated with one or more instances of that file
system on other servers. These attributes specify such file systems system on other servers. These attributes specify such file systems
by specifying a server name (either a DNS name or an IP address) by specifying a server name (either a DNS name or an IP address)
together with the path of that file system within that server's together with the path of that file system within that server's
single-server name space. single-server name space.
The fs_locations_info recommended attribute allows specification of The fs_locations_info recommended attribute allows specification of
one more file systems locations where the data corresponding to a one more file systems instance locations where the data corresponding
given file system may be found. This attributes provides to the to a given file system may be found. This attribute provides to the
client, in addition to information about file system locations, client, in addition to information about file system instance
extensive information about the various file system choices (e.g. locations, extensive information about the various file system
priority for use, writability, currency, etc.) as well as information instance choices (e.g. priority for use, writability, currency, etc.)
to help the client efficiently effect as seamless a transition as as well as information to help the client efficiently effect as
possible among multiple file system instances, when and if that seamless a transition as possible among multiple file system
should be necessary. instances, when and if that should be necessary.
The fs_locations recommended attribute is inherited from NFSv4.0 and The fs_locations recommended attribute is inherited from NFSv4.0 and
only allows specification of the file system locations where the data only allows specification of the file system locations where the data
corresponding to a given file system may be found. Servers should corresponding to a given file system may be found. Servers should
make this attribute available whenever fs_locations_info is make this attribute available whenever fs_locations_info is
supported, but client use of fs_locations_info is to be preferred. supported, but client use of fs_locations_info is to be preferred.
10.2. File System Presence or Absence 10.2. File System Presence or Absence
A given location in an NFSv4 namespace (typically but not necessarily A given location in an NFSv4 namespace (typically but not necessarily
a multi-server namespace) can have a number of file system locations a multi-server namespace) can have a number of file system instance
associated with it (via the fs_locations or fs_locations_info locations associated with it (via the fs_locations or
attribute). There may also be an actual current file system at that fs_locations_info attribute). There may also be an actual current
location, accessible via normal namespace operations (e.g. LOOKUP). file system at that location, accessible via normal namespace
In this case there, the file system is said to be "present" at that operations (e.g. LOOKUP). In this case, the file system is said to
position in the namespace and clients will typically use it, be "present" at that position in the namespace and clients will
reserving use of additional locations specified via the location- typically use it, reserving use of additional locations specified via
related attributes to situations in which the principal location is the location-related attributes to situations in which the principal
no longer available. location is no longer available.
When there is no actual file system at the namespace location in When there is no actual file system at the namespace location in
question, the file system is said to be "absent". An absent file question, the file system is said to be "absent". An absent file
system contains no files or directories other than the root and any system contains no files or directories other than the root and any
reference to it, except to access a small set of attributes useful in reference to it, except to access a small set of attributes useful in
determining alternate locations, will result in an error, determining alternate locations, will result in an error,
NFS4ERR_MOVED. Note that if the server ever returns NFS4ERR_MOVED NFS4ERR_MOVED. Note that if the server ever returns NFS4ERR_MOVED
(i.e. file systems may be absent), it MUST support the fs_locations (i.e. file systems may be absent), it MUST support the fs_locations
attribute and SHOULD support the fs_locations_info and fs_absent attribute and SHOULD support the fs_locations_info and fs_absent
attributes. attributes.
skipping to change at page 164, line 34 skipping to change at page 169, line 34
more limited conception of its function, but this error will be more limited conception of its function, but this error will be
returned whenever the referenced file system is absent, whether it returned whenever the referenced file system is absent, whether it
has moved or not. has moved or not.
Except in the case of GETATTR-type operations (to be discussed Except in the case of GETATTR-type operations (to be discussed
later), when the current filehandle at the start of an operation is later), when the current filehandle at the start of an operation is
within an absent file system, that operation is not performed and the within an absent file system, that operation is not performed and the
error NFS4ERR_MOVED returned, to indicate that the file system is error NFS4ERR_MOVED returned, to indicate that the file system is
absent on the current server. absent on the current server.
Because a GETFH cannot succeed, if the current filehandle is within Because a GETFH cannot succeed if the current filehandle is within an
an absent file system, filehandles within an absent file system absent file system, filehandles within an absent file system cannot
cannot be transferred to the client. When a client does have be transferred to the client. When a client does have filehandles
filehandles within an absent file system, it is the result of within an absent file system, it is the result of obtaining them when
obtaining them when the file system was present, and having the file the file system was present, and having the file system become absent
system become absent subsequently. subsequently.
It should be noted that because the check for the current filehandle It should be noted that because the check for the current filehandle
being within an absent file system happens at the start of every being within an absent file system happens at the start of every
operation, operations which change the current filehandle so that it operation, operations which change the current filehandle so that it
is within an absent file system will not result in an error. This is within an absent file system will not result in an error. This
allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be
used to get attribute information, particularly location attribute used to get attribute information, particularly location attribute
information, as discussed below. information, as discussed below.
The recommended file system attribute fs_absent can used to The recommended file system attribute fs_absent can used to
skipping to change at page 166, line 10 skipping to change at page 171, line 10
gratuitously providing additional information. gratuitously providing additional information.
When a GETATTR operation includes a bit mask for one of the When a GETATTR operation includes a bit mask for one of the
attributes fs_locations, fs_locations_info, or absent, but where the attributes fs_locations, fs_locations_info, or absent, but where the
bit mask includes attributes which are not supported, GETATTR will bit mask includes attributes which are not supported, GETATTR will
not return an error, but will return the mask of the actual not return an error, but will return the mask of the actual
attributes supported with the results. attributes supported with the results.
Handling of VERIFY/NVERIFY is similar to GETATTR in that if the Handling of VERIFY/NVERIFY is similar to GETATTR in that if the
attribute mask does not include fs_locations, fs_locations_info, or attribute mask does not include fs_locations, fs_locations_info, or
absent, the error NFS4ERR_MOVED will result. It differs in that any fs_absent, the error NFS4ERR_MOVED will result. It differs in that
appearance in the attribute mask of an attribute not supported for an any appearance in the attribute mask of an attribute not supported
absent file system (and note that this will include some normally for an absent file system (and note that this will include some
mandatory attributes), will also cause an NFS4ERR_MOVED result. normally mandatory attributes), will also cause an NFS4ERR_MOVED
result.
10.3.2. READDIR and Absent File Systems 10.3.2. READDIR and Absent File Systems
A READDIR performed when the current filehandle is within an absent A READDIR performed when the current filehandle is within an absent
file system will result in an NFS4ERR_MOVED error, since, unlike the file system will result in an NFS4ERR_MOVED error, since, unlike the
case of GETATTR, no such exception is made for READDIR. case of GETATTR, no such exception is made for READDIR.
Attributes for an absent file system may be fetched via a READDIR for Attributes for an absent file system may be fetched via a READDIR for
a directory in a present file system, when that directory contains a directory in a present file system, when that directory contains
the root directories of one or more absent file systems. In this the root directories of one or more absent file systems. In this
case, the handling is as follows: case, the handling is as follows:
o If the attribute set requested includes one of the attributes o If the attribute set requested includes one of the attributes
fs_locations, fs_locations_info, or absent, then fetching of fs_locations, fs_locations_info, or fs_absent, then fetching of
attributes proceeds normally and no NFS4ERR_MOVED indication is attributes proceeds normally and no NFS4ERR_MOVED indication is
returned, even when the rdattr_error attribute is requested. returned, even when the rdattr_error attribute is requested.
o If the attribute set requested does not include one of the o If the attribute set requested does not include one of the
attributes fs_locations, fs_locations_info, or fs_absent, then if attributes fs_locations, fs_locations_info, or fs_absent, then if
the rdattr_error attribute is requested, each directory entry for the rdattr_error attribute is requested, each directory entry for
the root of an absent file system, will report NFS4ERR_MOVED as the root of an absent file system, will report NFS4ERR_MOVED as
the value of the rdattr_error attribute. the value of the rdattr_error attribute.
o If the attribute set requested does not include any of the o If the attribute set requested does not include any of the
skipping to change at page 167, line 16 skipping to change at page 172, line 16
The location-bearing attributes (fs_locations and fs_locations_info), The location-bearing attributes (fs_locations and fs_locations_info),
provide, together with the possibility of absent file systems, a provide, together with the possibility of absent file systems, a
number of important facilities in providing reliable, manageable, and number of important facilities in providing reliable, manageable, and
scalable data access. scalable data access.
When a file system is present, these attribute can provide When a file system is present, these attribute can provide
alternative locations, to be used to access the same data, in the alternative locations, to be used to access the same data, in the
event that server failures, communications problems, or other event that server failures, communications problems, or other
difficulties, make continued access to the current file system difficulties, make continued access to the current file system
impossible or otherwise impractical. Provision of such alternate impossible or otherwise impractical. Under some circumstances
locations is referred to as "replication" although there are cases in multiple alternative locations may be used simultaneously to provide
which replicated sets of data are not in fact present, and the higher performance access to the file system in question. Provision
replicas are instead different paths to the same data. of such alternate locations is referred to as "replication" although
there are cases in which replicated sets of data are not in fact
present, and the replicas are instead different paths to the same
data.
When a file system is present and becomes absent, clients can be When a file system is present and becomes absent, clients can be
given the opportunity to have continued access to their data, at an given the opportunity to have continued access to their data, at an
alternate location. In this case, a continued attempt to use the alternate location. In this case, a continued attempt to use the
data in the now-absent file system will result in an NFSERR_MOVED data in the now-absent file system will result in an NFSERR_MOVED
error and at that point the successor locations (typically only one error and at that point the successor locations (typically only one
but multiple choices are possible) can be fetched and used to but multiple choices are possible) can be fetched and used to
continue access. Transfer of the file system contents to the new continue access. Transfer of the file system contents to the new
location is referred to as "migration", but it should be kept in mind location is referred to as "migration", but it should be kept in mind
that there are cases in which this term can be used, like that there are cases in which this term can be used, like
"replication" when there is no actual data migration per se. "replication", when there is no actual data migration per se.
Where a file system was not previously present, specification of file Where a file system was not previously present, specification of file
system location provides a means by which file systems located on one system location provides a means by which file systems located on one
server can be associated with a name space defined by another server, server can be associated with a name space defined by another server,
thus allowing a general multi-server namespace facility. Designation thus allowing a general multi-server namespace facility. Designation
of such a location, in place of an absent file system, is called of such a location, in place of an absent file system, is called
"referral". "referral".
10.4.1. File System Replication 10.4.1. File System Replication
The fs_locations and fs_locations_info attributes provide alternative The fs_locations and fs_locations_info attributes provide alternative
locations, to be used to access data in place of the current file locations, to be used to access data in place of or in a addition to
system. On first access to a file system, the client should obtain the current file system instance. On first access to a file system,
the value of the set alternate locations by interrogating the the client should obtain the value of the set alternate locations by
fs_locations or fs_locations_info attribute, with the latter being interrogating the fs_locations or fs_locations_info attribute, with
preferred. the latter being preferred.
In the event that server failures, communications problems, or other In the event that server failures, communications problems, or other
difficulties, make continued access to the current file system difficulties, make continued access to the current file system
impossible or otherwise impractical, the client can use the alternate impossible or otherwise impractical, the client can use the alternate
locations as a way to get continued access to his data. locations as a way to get continued access to his data. Depending on
specific attributes of these alternate locations, as indicated within
the fs_locations_info attribute, multiple locations may be used
simultaneously, to provide higher performance through the
exploitation of multiple paths between client and target filesystem.
The alternate locations may be physical replicas of the (typically The alternate locations may be physical replicas of the (typically
read-only) file system data, or they may reflect alternate paths to read-only) file system data, or they may reflect alternate paths to
the same server or provide for the use of various form of server the same server or provide for the use of various form of server
clustering in which multiple servers provide alternate ways of clustering in which multiple servers provide alternate ways of
accessing the same physical file system. How these different modes accessing the same physical file system. How these different modes
of file system transition are represented within the fs_locations and of file system transition are represented within the fs_locations and
fs_locations_info attributes and how the client deals with file fs_locations_info attributes and how the client deals with file
system transition issues will be discussed in detail below. system transition issues will be discussed in detail below.
When multiple server addresses correspond to the same actual server,
as shown by a common so_major_id field within the server_owner4 field
returned by EXCHANGE_ID, the client may assume that for each
filesystem in the namespace of a given server IP, there exist
filesystems at corresponding namespace locations for each of the
other server IP's, even in the absence of explicit listing in
fs_locations and fs_locations_info. Such corresponding file system
locations can be used as alternate locations, just as those
explicitly specified via the fs_locations and fs_locations_info
attributes. Where these specific locations are designated in the
fs_locations_info attribute, the conditions of use specified in this
attribute (e.g. priorities, specification of simultaneous use) may
limit the clients use of these alternate locations.
When multiple replicas exist and are used simultaneously or in
succession by a client, they must designate the same data (with
metadata being the same to the degree indicated by the
fs_locations_info attribute). Where filesystems are writable, a
change made on one instance must be visible on all instances,
immediately upon the earlier of the return of the modifying request
or the visibility of that change on any of the associated replicas.
Where a filesystem is not writable but represents a read-only copy
(possibly periodically updated) of a writable filesystem, similar
requirements apply to the propagation of updates. It must be
guaranteed that any change visible on the original file system
instance must be immediately visible on any replica before the client
transitions access to that replica, to avoid any possibility, that a
client in effecting a transition to a replica, will see any reversion
in filesystem state. The specific means by which this will be
prevented varies based on fs4_status_type reported as part of the
fs_status attribute. (See Section 10.11).
10.4.2. File System Migration 10.4.2. File System Migration
When a file system is present and becomes absent, clients can be When a file system is present and becomes absent, clients can be
given the opportunity to have continued access to their data, at an given the opportunity to have continued access to their data, at an
alternate location, as specified by the fs_locations or alternate location, as specified by the fs_locations or
fs_locations_info attribute. Typically, a client will be accessing fs_locations_info attribute. Typically, a client will be accessing
the file system in question, get a an NFS4ERR_MOVED error, and then the file system in question, get an NFS4ERR_MOVED error, and then use
use the fs_locations or fs_locations_info attribute to determine the the fs_locations or fs_locations_info attribute to determine the new
new location of the data. When fs_locations_info is used, additional location of the data. When fs_locations_info is used, additional
information will be available which will define the nature of the information will be available which will define the nature of the
client's handling of the transition to a new server. client's handling of the transition to a new server.
Such migration can be helpful in providing load balancing or general Such migration can be helpful in providing load balancing or general
resource reallocation. The protocol does not specify how the file resource reallocation. The protocol does not specify how the file
system will be moved between servers. It is anticipated that a system will be moved between servers. It is anticipated that a
number of different server-to-server transfer mechanisms might be number of different server-to-server transfer mechanisms might be
used with the choice left to the server implementor. The NFSv4.1 used with the choice left to the server implementer. The NFSv4.1
protocol specifies the method used to communicate the migration event protocol specifies the method used to communicate the migration event
between client and server. between client and server.
The new location may be an alternate communication path to the same The new location may be an alternate communication path to the same
server, or, in the case of various forms of server clustering, server, or, in the case of various forms of server clustering,
another server providing access to the same physical file system. another server providing access to the same physical file system.
The client's responsibilities in dealing with this transition depend The client's responsibilities in dealing with this transition depend
on the specific nature of the new access path and how and whether on the specific nature of the new access path and how and whether
data was in fact migrated. These issues will be discussed in detail data was in fact migrated. These issues will be discussed in detail
below. below.
When multiple server addresses correspond to the same actual server,
as shown by a common value for so_major_id field of the server_owner4
value returned by EXCHANGE_ID, the location or locations may
designate alternate server addresses in the form of specific server
IP addresses, when the filesystem in question is available at those
addresses, and no longer accessible at the original address.
Although a single successor location is typical, multiple locations Although a single successor location is typical, multiple locations
may be provided, together with information that allows priority among may be provided, together with information that allows priority among
the choices to be indicated, via information in the fs_locations_info the choices to be indicated, via information in the fs_locations_info
attribute. Where suitable clustering mechanisms make it possible to attribute. Where suitable clustering mechanisms make it possible to
provide multiple identical file systems or paths to them, this allows provide multiple identical file systems or paths to them, this allows
the client the opportunity to deal with any resource or the client the opportunity to deal with any resource or
communications issues that might limit data availability. communications issues that might limit data availability.
When an alternate location is designated as the target for migration,
it must designate the same data (with metadata being the same to the
degree indicated by the fs_locations_info attribute). Where
filesystems are writable, a change made on the original filesystem
must be visible on all migration targets. Where a filesystem is not
writable but represents a read-only copy (possibly periodically
updated) of a writable filesystem, similar requirements apply to the
propagation of updates. Any change visible in the original
filesystem must already be effected on all migration targets, to
avoid any possibility, that a client in effecting a transition to the
migration target will see any reversion in filesystem state.
10.4.3. Referrals 10.4.3. Referrals
Referrals provide a way of placing a file system in a location Referrals provide a way of placing a file system in a location
essentially without respect to its physical location on a given essentially without respect to its physical location on a given
server. This allows a single server of a set of servers to present a server. This allows a single server of a set of servers to present a
multi-server namespace that encompasses file systems located on multi-server namespace that encompasses file systems located on
multiple servers. Some likely uses of this include establishment of multiple servers. Some likely uses of this include establishment of
site-wide or organization-wide namespaces, or even knitting such site-wide or organization-wide namespaces, or even knitting such
together into a truly global namespace. together into a truly global namespace.
Referrals occur when a client determines, upon first referencing a Referrals occur when a client determines, upon first referencing a
position in the current namespace, that it is part of a new file position in the current namespace, that it is part of a new file
system and that that file system is absent. When this occurs, system and that that file system is absent. When this occurs,
typically by receiving the error NFS4ERR_MOVED, the actual location typically by receiving the error NFS4ERR_MOVED, the actual location
or locations of the file system can be determined by fetching the or locations of the file system can be determined by fetching the
fs_locations or fs_locations_info attribute. fs_locations or fs_locations_info attribute.
The locations-related attribute may designate a single file system
location or multiple file system locations, to be selected based on
the needs of the client. The server, in the locations_info attribute
may specify priorities to be associated with various file system
location choices. The server may assign different priorities to
different locations as reported to individual clients, in order to
adapt to client physical location or to effect load balancing. When
both read-only and read-write filesystems are present, some of the
read-only locations may not absolutely up-to-date (as they would have
to be in the case of replication and migration). Servers may also
specify filesystem locations that include client-substituted variable
so that different clients are referred to different file systems
(with different data contents) based on client attributes such as cpu
architecture.
Use of multi-server namespaces is enabled by NFSv4 but is not Use of multi-server namespaces is enabled by NFSv4 but is not
required. The use of multi-server namespaces and their scope will required. The use of multi-server namespaces and their scope will
depend on the application used, and system administration depend on the applications used, and system administration
preferences. preferences.
Multi-server namespaces can be established by a single server Multi-server namespaces can be established by a single server
providing a large set of referrals to all of the included file providing a large set of referrals to all of the included file
systems. Alternatively, a single multi-server namespace may be systems. Alternatively, a single multi-server namespace may be
administratively segmented with separate referral file systems (on administratively segmented with separate referral file systems (on
separate servers) for each separately-administered section of the separate servers) for each separately-administered section of the
name space. Any segment or the top-level referral file system may name space. Any segment or the top-level referral file system may
use replicated referral file systems for higher availability. use replicated referral file systems for higher availability.
Generally, multi-server namespaces are for the most part uniform, in
that the same data made available to one client at a given location
in the namespace is made availably to all clients at that location.
There are however facilities provided which allow different client to
be directed to different sets of data, so as to adapt to such client
characteristics as cpu architecture.
10.5. Additional Client-side Considerations 10.5. Additional Client-side Considerations
When clients make use of servers that implement referrals and When clients make use of servers that implement referrals,
migration, care should be taken so that a user who mounts a given replication, and migration, care should be taken so that a user who
file system that includes a referral or a relocated file system mounts a given file system that includes a referral or a relocated
continue to see a coherent picture of that user-side file system file system continue to see a coherent picture of that user-side file
despite the fact that it contains a number of server-side file system despite the fact that it contains a number of server-side file
systems which may be on different servers. systems which may be on different servers.
One important issue is upward navigation from the root of a server- One important issue is upward navigation from the root of a server-
side file system to its parent (specified as ".." in UNIX). The side file system to its parent (specified as ".." in UNIX). The
client needs to determine when it hits an fsid root going up the client needs to determine when it hits an fsid root going up the file
filetree. When at such a point, and needs to ascend to the parent, tree. When at such a point, and needs to ascend to the parent, it
it must do so locally instead of sending a LOOKUPP call to the must do so locally instead of sending a LOOKUPP call to the server.
server. The LOOKUPP would normally return the ancestor of the target The LOOKUPP would normally return the ancestor of the target file
file system on the target server, which may not be part of the space system on the target server, which may not be part of the space that
that the client mounted. the client mounted.
A related issue is upward navigation from named attribute
directories. The named attribute directories are essentially
detached from the namespace and this property should be safely
represented in the client operating environment. LOOKUPP on a named
attribute directory may return the filehandle of the associated file
and conveying this to applications might be unsafe as many
applications expect the parent of a directory to be a directory by
itself. Therefore the client may want to hide the parent of named
attribute directories (represented as ".." in UNIX) or represent the
named attribute directory as its own parent (as typically done for
the filesystem root directory in UNIX)
Another issue concerns refresh of referral locations. When referrals Another issue concerns refresh of referral locations. When referrals
are used extensively, they may change as server configurations are used extensively, they may change as server configurations
change. It is expected that clients will cache information related change. It is expected that clients will cache information related
to traversing referrals so that future client side requests are to traversing referrals so that future client side requests are
resolved locally without server communication. This is usually resolved locally without server communication. This is usually
rooted in client-side name lookup caching. Clients should rooted in client-side name lookup caching. Clients should
periodically purge this data for referral points in order to detect periodically purge this data for referral points in order to detect
changes in location information. When the change attribute changes changes in location information. When the change attribute changes
for directories that hold referral entries or for the referral for directories that hold referral entries or for the referral
entries themselves, clients should consider any associated cached entries themselves, clients should consider any associated cached
referral information to be out of date. referral information to be out of date.
10.6. Effecting File System Transitions 10.6. Effecting File System Transitions
Transitions between file system instances, whether due to switching Transitions between file system instances, whether due to switching
between replicas upon server unavailability, or in response to a between replicas upon server unavailability, or in response to a
server-initiated migration event are best dealt with together. Even server-initiated migration events are best dealt with together. Even
though the prototypical use cases of replication and migration though the prototypical use cases of replication and migration
contain distinctive sets of features, when all possibilities for contain distinctive sets of features, when all possibilities for
these operations are considered, the underlying unity of these these operations are considered, the underlying unity of these
operations, from the client's point of view is clear, even though for operations, from the client's point of view is clear, even though for
the server pragmatic considerations will normally force different the server pragmatic considerations will normally force different
implementation strategies for planned and unplanned transitions. implementation strategies for planned and unplanned transitions.
A number of methods are possible for servers to replicate data and to A number of methods are possible for servers to replicate data and to
track client state in order to allow clients to transition between track client state in order to allow clients to transition between
file system instances with a minimum of disruption. Such methods file system instances with a minimum of disruption. Such methods
skipping to change at page 170, line 45 skipping to change at page 177, line 37
a greater burden on the client to adapt to the transition. a greater burden on the client to adapt to the transition.
The NFSv4.1 protocol does not impose choices on clients and servers The NFSv4.1 protocol does not impose choices on clients and servers
with regard to that spectrum of transition methods. In fact, there with regard to that spectrum of transition methods. In fact, there
are many valid choices, depending on client and application are many valid choices, depending on client and application
requirements and their interaction with server implementation requirements and their interaction with server implementation
choices. The NFSv4.1 protocol does define the specific choices that choices. The NFSv4.1 protocol does define the specific choices that
can be made, how these choices are communicated to the client and how can be made, how these choices are communicated to the client and how
the client is to deal with any discontinuities. the client is to deal with any discontinuities.
In the sections below references will be made to various possible In the sections below, references will be made to various possible
server implementation choices as a way of illustrating the transition server implementation choices as a way of illustrating the transition
scenarios that clients may deal with. The intent here is not to scenarios that clients may deal with. The intent here is not to
define or limit server implementations but rather to illustrate the define or limit server implementations but rather to illustrate the
range of issues that clients may face. range of issues that clients may face.
In the discussion below, references will be made to a file system In the discussion below, references will be made to a file system
having a particular property or of two file systems (typically the having a particular property or of two file systems (typically the
source and destination) belonging to a common class of any of several source and destination) belonging to a common class of any of several
types. Two file systems that belong to such a class share some types. Two file systems that belong to such a class share some
important aspect of file system behavior that clients may depend upon important aspect of file system behavior that clients may depend upon
skipping to change at page 171, line 28 skipping to change at page 178, line 21
In cases in which one server is expected to accept opaque values from In cases in which one server is expected to accept opaque values from
the client that originated from another server, it is a wise the client that originated from another server, it is a wise
implementation practice for the servers to encode the "opaque" values implementation practice for the servers to encode the "opaque" values
in network byte order. If this is done, servers acting as replicas in network byte order. If this is done, servers acting as replicas
or immigrating file systems will be able to parse values like or immigrating file systems will be able to parse values like
stateids, directory cookies, filehandles, etc. even if their native stateids, directory cookies, filehandles, etc. even if their native
byte order is different from that of other servers cooperating in the byte order is different from that of other servers cooperating in the
replication and migration of the file system. replication and migration of the file system.
10.6.1. Transparent File System Transitions 10.6.1. File System Transitions and Simultaneous Access
Discussion of transition possibilities will start at the most When a single filesystem may be accessed at multiple locations,
transparent end of the spectrum of possibilities. When there are whether this is because of an indication of file system identity as
multiple paths to a single server, and there are network problems reported by the fs_locations or fs_locations_info attributes or
that force another path to be used, or when a path is to be put out because two file systems instances have corresponding locations on
of service, a replication or migration event may occur without any server addresses which connect to the same server as indicated by a
real replication or migration. Nevertheless, such events fit within common so_major_id field in the server_owner4 field returned by
the same general framework in that there is a transition between file EXCHANGE_ID, the client will, depending on specific circumstances as
system locations, communicated just as other, less transparent discussed below, either:
transitions are communicated.
There are cases of transparent transitions that may happen o Access multiple instances simultaneously, as representing
independent of location information, in that a specific host name, alternate paths to the same data and metadata.
may map to several IP addresses, allowing session trunking to provide
alternate paths. In other cases, however multiple addresses may have
separate location entries for specific file systems to preferentially
direct traffic for those specific file systems to certain server
addresses, subject to planned or unplanned, corresponding to a
nominal replication or migrations event.
The specific details of the transition depend on file system o The client accesses one instance (or set of instances) and then
equivalence class information (as provided by the fs_locations_info transitions to an alternative instance (or set of instances) as a
and fs_locations attributes). result of network issues, server unresponsiveness, or server-
directed migration. The transition may involve changes in
filehandles, fileids, the change attribute, and or locking state,
depending on the attributes of the source and destination file
system instances, as specified in the fs_locations_info attribute.
o Where the old and new file systems belong to the same _endpoint_ Which of these choices is possible, and how a transition is effected
class, the transition consists of creating a new connection which is governed by equivalence classes of file system instances as
is associated with the existing session to the old server reported by the fs_locations_info attribute, and, for file systems
endpoint. Where a connection cannot be associated with the instances in the same location within multiple single-server
existing session, the target server must be able to recognize the namespace, by the so_major_id field in the server_owner4 returned by
sessionid as invalid and force creation on a new session or a new EXCHANGE_ID.
client id.
o Where the old and new file systems do not belong to the same 10.6.2. Simultaneous Use and Transparent Transitions
_endpoint_ classes, but to the same _server_ class, the transition
consists of creating a new session, associated with the existing
clientid. Where the clientid is stale, the target server must be
able to recognize the clientid as no longer valid and force
creation of a new clientid.
In either of the above cases, the file system may be shown as When two file system instances have the same location within their
belonging to the same _sharing_ class, class allowing the alternate respective single-server namespaces and those two server IP addresses
session or connection to be established in advance and used either to return the so_major_id value in the server_owner4 value returned in
accelerate the file system transition when necessary (avoiding response to EXCHANGE_ID, those file systems instances can be treated
connection latency), or to provide higher performance by actively as the same, and either used together simultaneously or serially with
using multiple paths simultaneously. no transition activity required on the part of the client.
When two file systems belong to the same _endpoint_ class, or Whether simultaneous use of the two file system instances is valid is
_sharing_ class, many transition issues are eliminated, and any controlled by whether the fs_locations_info attribute shows the two
information indicating otherwise is ignored as erroneous. instances as having the same _simultaneous-use_ class.
Note that for two such file systems, any information within the
fs_locations_info attribute that indicates the need for special
transition activity, i.e. the appearance of the two file system
instances with different _handle_, _fileid_, _verifier_, _change_
classes, MUST be ignored by the client. The server SHOULD not
indicate that these instances belong to different _handle_, _fileid_,
_verifier_, _change_ classes, whether the two instances are shown
belonging to the same _simultaneous-use_ class or not.
Where these conditions do not apply, a non-transparent file system
instance transition is required with the details depending on the
respective _handle_, _fileid_, _verifier_, _change_ classes of the
two file system instances and whether the two servers in question
have the same eir_server_scope value as reported by EXCHANGE_ID.
10.6.2.1. Simultaneous Use of File System Instances
When the conditions above hold, in either of the following two cases,
the client may use the two file system instances simultaneously.
o The fs_locations_info attribute does not contain separate per-IP
address entries for file systems instances at the distinct IP
addresses. This includes the case in which the fs_locations_info
attribute is unavailable.
o The fs_locations_info attribute indicates that two file system
instances belong to the same _simultaneous-use_ class.
In this case, the client may use both file system instances
simultaneously, as representations of the same file system, whether
that happens because the two IP addresses connect to the same
physical server or because different servers connect to clustered
file systems and export their data in common. When simultaneous use
is in effect, any change made to one file system instance must be
immediately reflected in the other file system instance(s). Locks
are treated as part of a common lease, associated with a common
clientid. Depending on the details of the serverver_owner4 returned
by EXCHANGE_ID, the two server instances may be accessed by different
sessions or a single session in common.
10.6.2.2. Transparent File System Transitions
When the conditions above hold and the fs_locations_info attribute
explicitly shows the file system instances for these distinct IP
addresses as belonging to different _simultaneous-use_ classes, the
file system instances should not be used by the client
simultaneously, but rather serially with one being used unless and
until communication difficulties, lack of responsiveness, or an
explicit migration event causes another file system instance (or set
of file system instances sharing a common _simultaneous-use_ class to
be used.
When a change in file system instance is to be done, the client will
use the same clientid already in effect. If it already has
connections to the new server address, these will be used. Otherwise
new connections to existing sessions or new sessions associated with
the existing clientid are established as indicated by the
server_owner4 returned by EXCHANGE_ID.
In all such transparent transition cases, the following apply: In all such transparent transition cases, the following apply:
o File handles stay the same if persistent and if volatile are only o File handles stay the same if persistent and if volatile are only
subject to expiration, if they would be in the absence of file subject to expiration, if they would be in the absence of file
system transition. system transition.
o Fileid values do not change across the transition. o Fileid values do not change across the transition.
o The file system will have the same fsid in both the old and new o The file system will have the same fsid in both the old and new
the old and new locations. locations.
o Change attribute values are consistent across the transition and o Change attribute values are consistent across the transition and
do not have to be refetched. When change attributes indicate that do not have to be refetched. When change attributes indicate that
a cached object is still valid, it can remain cached. a cached object is still valid, it can remain cached.
o Session, client, and state identifier retain their validity across o Client, and state identifier retain their validity across the
the transition, except where their staleness is recognized and transition, except where their staleness is recognized and
reported by the new server. Except where such staleness requires reported by the new server. Except where such staleness requires
it, no lock reclamation is needed. it, no lock reclamation is needed.
o Write verifiers are presumed to retain their validity and can be o Write verifiers are presumed to retain their validity and can be
presented to COMMIT, with the expectation that if COMMIT on the presented to COMMIT, with the expectation that if COMMIT on the
new server accept them as valid, then that server has all of the new server accept them as valid, then that server has all of the
data unstably written to the original server and has committed it data unstably written to the original server and has committed it
to stable storage as requested. to stable storage as requested.
10.6.2. Filehandles and File System Transitions 10.6.3. Filehandles and File System Transitions
There are a number of ways in which filehandles can be handled across There are a number of ways in which filehandles can be handled across
a file system transition. These can be divided into two broad a file system transition. These can be divided into two broad
classes depending upon whether the two file systems across which the classes depending upon whether the two file systems across which the
transition happens share sufficient state to effect some sort of transition happens share sufficient state to effect some sort of
continuity of file system handling. continuity of file system handling.
When there is no such co-operation in filehandle assignment, the two When there is no such co-operation in filehandle assignment, the two
file systems are reported as being in different _handle_ classes. In file systems are reported as being in different _handle_ classes. In
this case, all filehandles are assumed to expire as part of the file this case, all filehandles are assumed to expire as part of the file
skipping to change at page 173, line 31 skipping to change at page 181, line 29
FH4_VOL_MIGRATION bit, which only affects behavior when FH4_VOL_MIGRATION bit, which only affects behavior when
fs_locations_info is not available. fs_locations_info is not available.
When there is co-operation in filehandle assignment, the two file When there is co-operation in filehandle assignment, the two file
systems are reported as being in the same _handle_ classes. In this systems are reported as being in the same _handle_ classes. In this
case, persistent filehandle remain valid after the file system case, persistent filehandle remain valid after the file system
transition, while volatile filehandles (excluding those while are transition, while volatile filehandles (excluding those while are
only volatile due to the FH4_VOL_MIGRATION bit) are subject to only volatile due to the FH4_VOL_MIGRATION bit) are subject to
expiration on the target server. expiration on the target server.
10.6.3. Fileid's and File System Transitions 10.6.4. Fileid's and File System Transitions
In NFSv4.0, the issue of continuity of fileid's in the event of a In NFSv4.0, the issue of continuity of fileid's in the event of a
file system transition was not addressed. The general expectation file system transition was not addressed. The general expectation
had been that in situations in which the two file system instances had been that in situations in which the two file system instances
are created by a single vendor using some sort of file system image are created by a single vendor using some sort of file system image
copy, fileid's will be consistent across the transition while in the copy, fileid's will be consistent across the transition while in the
analogous multi-vendor transitions they will not. This poses analogous multi-vendor transitions they will not. This poses
difficulties, especially for the client without special knowledge of difficulties, especially for the client without special knowledge of
the of the transition mechanisms adopted by the server. the of the transition mechanisms adopted by the server.
skipping to change at page 174, line 25 skipping to change at page 182, line 23
value across a migration event, allowing a truly transparent value across a migration event, allowing a truly transparent
migration event. migration event.
In any case, where servers can provide continuity of fileids, they In any case, where servers can provide continuity of fileids, they
should and the client should be able to find out that such continuity should and the client should be able to find out that such continuity
is available, and take appropriate action. Information about the is available, and take appropriate action. Information about the
continuity (or lack thereof) of fileid's across a file system is continuity (or lack thereof) of fileid's across a file system is
represented by specifying whether the file systems in question are of represented by specifying whether the file systems in question are of
the same _fileid_ class. the same _fileid_ class.
10.6.4. Fsid's and File System Transitions 10.6.5. Fsid's and File System Transitions
Since fsid's are only unique within a per-server basis, it is to be Since fsid's are only unique within a per-server basis, it is to be
expected that they will change during a file system transition. expected that they will change during a file system transition.
Clients should not make the fsid's received from the server visible Clients should not make the fsid's received from the server visible
to application since they may not be globally unique, and because to application since they may not be globally unique, and because
they may change during a file system transition event. Applications they may change during a file system transition event. Applications
are best served if they are isolated from such transitions to the are best served if they are isolated from such transitions to the
extent possible. extent possible.
10.6.5. The Change Attribute and File System Transitions When a file system transition is made and the fs_locations_info
indicates that file system in question may be split into multiple
file systems (via the LIF_MULTI_FS flag), client should do GETATTR's
on all known objects within the file system undergoing transition, to
determine the new file system boundaries. Clients may maintain the
fsid's passed to existing applications by mapping all of the fsid for
the descendent file systems to a the common fsid used for the
original file system.
10.6.6. The Change Attribute and File System Transitions
Since the change attribute is defined as a server-specific one, Since the change attribute is defined as a server-specific one,
change attributes fetched from one server are normally presumed to be change attributes fetched from one server are normally presumed to be
invalid on another server. Such a presumption is troublesome since invalid on another server. Such a presumption is troublesome since
it would invalidate all cached change attributes, requiring it would invalidate all cached change attributes, requiring
refetching. Even more disruptive, the absence of any assured refetching. Even more disruptive, the absence of any assured
continuity for the change attribute means that even if the same value continuity for the change attribute means that even if the same value
is gotten on refetch no conclusions can drawn as to whether the is gotten on refetch no conclusions can drawn as to whether the
object in question has changed. The identical change attribute could object in question has changed. The identical change attribute could
be merely an artifact, of a modified file with a different change be merely an artifact, of a modified file with a different change
attribute construction algorithm, with that new algorithm just attribute construction algorithm, with that new algorithm just
happening to result in an identical change value. happening to result in an identical change value.
When the two file systems have consistent change attribute formats, When the two file systems have consistent change attribute formats,
and this fact is communicated to the client by reporting as in the and this fact is communicated to the client by reporting as in the
same _change_ class, the client may assume a continuity of change same _change_ class, the client may assume a continuity of change
attribute construction and handle this situation just as it would be attribute construction and handle this situation just as it would be
handled without any file system transition. handled without any file system transition.
10.6.6. Lock State and File System Transitions 10.6.7. Lock State and File System Transitions
In a file system transition, the two file systems may have co- In a file system transition, the client needs to handle cases in
operated in state management. When this is the case, and the two which the two servers have cooperated in state management and in
file systems belong to the same _state_ class, the two file systems which they have not. Cooperation by two servers in state management
will have compatible state environments. In the case of migration, requires coordination of clientids. Before the client attempts to
the servers involved in the migration of a file system SHOULD use a clientid associated with one server in a request to the server
transfer all server state from the original to the new server. When of the other file system, it must eliminate the possibility that two
this done, it must be done in a way that is transparent to the non-cooperating servers have assigned the same clientid by accident.
client. With replication, such a degree of common state is typically The client needs to compare the eir_server_scope values returned by
not the case. Clients, however should use the information provided each server. If the scope values do not match, then the servers have
by the fs_locations_info attribute to determine whether such sharing not cooperated in state management. If the scope values match, then
is in effect when this is available, and only if that attribute is this indicates the servers have cooperated in assigning clientids to
not available depend on these defaults. the point that they will reject clientids that refer to state they do
not know about.
In the case of migration, the servers involved in the migration of a
file system SHOULD transfer all server state from the original to the
new server. When this done, it must be done in a way that is
transparent to the client. With replication, such a degree of common
state is typically not the case. Clients, however should use the
information provided by the eir_server_scope returned by EXCHANGE_ID
to determine whether such sharing may be in effect, rather than
making assumptions based on the reason for the transition.
This state transfer will reduce disruption to the client when a file This state transfer will reduce disruption to the client when a file
system transition If the servers are successful in transferring all system transition If the servers are successful in transferring all
state, the client will continue to use stateids assigned by the state, the client can attempt to establish sessions associated with
original server. Therefore the new server must recognize these the client id used for the source file system instance. If the
stateids as valid. This holds true for the clientid as well. Since server accepts that as a valid clientid, then the client may used the
responsibility for an entire file system is transferred is with such existing stateid's associated with that clientid for the old file
an event, there is no possibility that conflicts will arise on the system instance in connection with the that same clientid in
new server as a result of the transfer of locks. connection with the file system instance.
As part of the transfer of information between servers, leases would
be transferred as well. The leases being transferred to the new
server will typically have a different expiration time from those for
the same client, previously on the old server. To maintain the
property that all leases on a given server for a given client expire
at the same time, the server should advance the expiration time to
the later of the leases being transferred or the leases already
present. This allows the client to maintain lease renewal of both
classes without special effort.
When the two servers belong to the same _state_ class, it does not When the two servers belong to the same server scope, it does
necessarily mean that when dealing with the transition, the client necessarily mean that when dealing with the transition, the client
will not have to reclaim state. However it does mean that the client will not have to reclaim state. However it does mean that the client
may proceed using his current clientid and stateid's just as if there may proceed using his current clientid when establishing
had been no file system transition event and only reclaim state when communication with the new server and that that new server will
an NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID error is received. either recognize that clientid as valid, or reject it, in which case
locks must be reclaimed by the client.
File systems co-operating in state management may actually share File systems co-operating in state management may actually share
state or simply divide the id space so as to recognize (and reject as state or simply divide the id space so as to recognize (and reject as
stale) each others state and clients id's. Servers which do share stale) each others state and clients id's. Servers which do share
state may not do under all conditions or all times. The requirement state may not do so under all conditions or at all times. The
for the server is that if it cannot be sure in accepting an id that requirement for the server is that if it cannot be sure in accepting
it reflects the locks the client was given, it must treat all a clientid that it reflects the locks the client was given, it must
associated state as stale and report it as such to the client. treat all associated state as stale and report it as such to the
client.
When two file systems belong to different _state_ classes, the client When the two file systems instances are on servers that do not share
must establish a new state on the destination, and reclaim if a server scope value the client must establish a new clientid on the
possible. In this case, old stateids and clientid's should not be destination, if it does not have one already and reclaim if possible.
presented to the new server since there is no assurance that they In this case, old stateids and clientid's should not be presented to
will not conflict with id's valid on that server. the new server since there is no assurance that they will not
conflict with id's valid on that server.
In either case, when actual locks are not known to be maintained, the In either case, when actual locks are not known to be maintained, the
destination server may establish a grace period specific to the given destination server may establish a grace period specific to the given
file system, with non-reclaim locks being rejected for that file file system, with non-reclaim locks being rejected for that file
system, even though normal locks are being granted for other file system, even though normal locks are being granted for other file
systems. Clients should not infer the absence of a grace period for systems. Clients should not infer the absence of a grace period for
file systems being transitioned to a server from responses to file systems being transitioned to a server from responses to
requests for other file systems. requests for other file systems.
In the case of lock reclamation for a given file system after a file In the case of lock reclamation for a given file system after a file
skipping to change at page 176, line 38 skipping to change at page 184, line 48
incorrectly granted, the destination server should not establish a incorrectly granted, the destination server should not establish a
file-system-specific grace period. file-system-specific grace period.
In place of a file-system-specific version of RECLAIM_COMPLETE, In place of a file-system-specific version of RECLAIM_COMPLETE,
servers may assume that an attempt to obtain a new lock, other than servers may assume that an attempt to obtain a new lock, other than
be reclaim, indicate the end of the client's attempt to reclaim locks be reclaim, indicate the end of the client's attempt to reclaim locks
for that file system. [NOTE: The alternative would be to adapt for that file system. [NOTE: The alternative would be to adapt
RECLAIM_COMPLETE to this task]. RECLAIM_COMPLETE to this task].
Information about client identity that may be propagated between Information about client identity that may be propagated between
servers in the form of nfs_client_id4 and associated verifiers, under servers in the form of client_owner4 and associated verifiers, under
the assumption that the client presents the same values to all the the assumption that the client presents the same values to all the
servers with which it deals. [NOTE: This contradicts what is servers with which it deals.
currently said about SETCLIENTID, and interacts with the issue of
what sessions should do about this.]
Servers are encouraged to provide facilities to allow locks to be Servers are encouraged to provide facilities to allow locks to be
reclaimed on the new server after a file system transition. Often, reclaimed on the new server after a file system transition. Often,
however, in cases in which the two file systems are not of the same however, in cases in which the two servers do not share a server
_state _ class, such facilities may not be available and client scope value, such facilities may not be available and client should
should be prepared to re-obtain locks, even though it is possible be prepared to re-obtain locks, even though it is possible that the
that the client may have his LOCK or OPEN request denied due to a client may have his LOCK or OPEN request denied due to a conflicting
conflicting lock. In some environments, such as the transition lock. In some environments, such as the transition between read-only
between read-only file systems, such denial of locks should not pose file systems, such denial of locks should not pose large difficulties
large difficulties in practice. When an attempt to re-establish a in practice. When an attempt to re-establish a lock on a new server
lock on a new server is denied, the client should treat the situation is denied, the client should treat the situation as if his original
as if his original lock had been revoked. In all cases in which the lock had been revoked. In all cases in which the lock is granted,
lock is granted, the client cannot assume that no conflicting could the client cannot assume that no conflicting could have been granted
have been granted in the interim. Where change attribute continuity in the interim. Where change attribute continuity is present, the
is present, the client may check the change attribute to check for client may check the change attribute to check for unwanted file
unwanted file modifications. Where even this is not available, and modifications. Where even this is not available, and the file system
the file system is not read-only a client may reasonably treat all is not read-only, a client may reasonably treat all pending locks as
pending locks as having been revoked. having been revoked.
10.6.6.1. Leases and File System Transitions 10.6.7.1. Leases and File System Transitions
In the case of lease renewal, the client may not be submitting In the case of lease renewal, the client may not be submitting
requests for a file system that has been transferred to another requests for a file system that has been transferred to another
server. This can occur because of the lease renewal mechanism. The server. This can occur because of the lease renewal mechanism. The
client renews leases for all file systems when submitting a request client renews leases for all file systems when submitting a request
to any one file system at the server. on an associated session, regardless of the specific file system
being referenced.
In order for the client to schedule renewal of leases that may have In order for the client to schedule renewal of leases that may have
been relocated to the new server, the client must find out about been relocated to the new server, the client must find out about
lease relocation before those leases expire. To accomplish this, all lease relocation before those leases expire. To accomplish this, the
operations which renew leases for a client (i.e. OPEN, CLOSE, READ, SEQUENCE operation will return the status bit
WRITE, RENEW, LOCK, LOCKT, LOCKU), will return the error SEQ4_STATUS_LEASE_MOVED, if responsibility for any of the leases to
NFS4ERR_LEASE_MOVED if responsibility for any of the leases to be be renewed has been transferred to a new server. This condition will
renewed has been transferred to a new server. This condition will
continue until the client receives an NFS4ERR_MOVED error and the continue until the client receives an NFS4ERR_MOVED error and the
server receives the subsequent GETATTR for the fs_locations or server receives the subsequent GETATTR for the fs_locations or
fs_locations_info attribute for an access to each file system for fs_locations_info attribute for an access to each file system for
which a lease has been moved to a new server. which a lease has been moved to a new server.
[ISSUE: There is a conflict between this and the idea in the sessions When a client receives an SEQ4_STATUS_LEASE_MOVED indication, it
text that we can have every op in the session implicitly renew the should perform an operation on each file system associated with the
lease. This needs to be dealt with. D. Noveck will create an issue server in question. When the client receives an NFS4ERR_MOVED error,
in the issue tracker.] the client can follow the normal process to obtain the new server
When a client receives an NFS4ERR_LEASE_MOVED error, it should
perform an operation on each file system associated with the server
in question. When the client receives an NFS4ERR_MOVED error, the
client can follow the normal process to obtain the new server
information (through the fs_locations and fs_locations_info information (through the fs_locations and fs_locations_info
attributes) and perform renewal of those leases on the new server, attributes) and perform renewal of those leases on the new server,
unless information in fs_locations_info attribute shows that no state unless information in fs_locations_info attribute shows that no state
could have been transferred. If the server has not had state could have been transferred. If the server has not had state
transferred to it transparently, the client will receive either transferred to it transparently, the client will receive either
NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from the new server, NFS4ERR_STALE_CLIENTID from the new server, as described above, and
as described above, and the client can then recover state information the client can then reclaim locks as is done in the event of server
as it does in the event of server failure. failure.
10.6.6.2. Transitions and the Lease_time Attribute 10.6.7.2. Transitions and the Lease_time Attribute
In order that the client may appropriately manage its leases in the In order that the client may appropriately manage its leases in the
case of a file system transition, the destination server must case of a file system transition, the destination server must
establish proper values for the lease_time attribute. establish proper values for the lease_time attribute.
When state is transferred transparently, that state should include When state is transferred transparently, that state should include
the correct value of the lease_time attribute. The lease_time the correct value of the lease_time attribute. The lease_time
attribute on the destination server must never be less than that on attribute on the destination server must never be less than that on
the source since this would result in premature expiration of leases the source since this would result in premature expiration of leases
granted by the source server. Upon transitions in which state is granted by the source server. Upon transitions in which state is
transferred transparently, the client is under no obligation to re- transferred transparently, the client is under no obligation to re-
fetch the lease_time attribute and may continue to use the value fetch the lease_time attribute and may continue to use the value
previously fetched (on the source server). previously fetched (on the source server).
If state has not been transferred transparently, either because the If state has not been transferred transparently, either because the
file systems are show as being in different state classes or because associated servers are show as have different eir_server_scope
the client sees a real or simulated server reboot), the client should strings or because the clientid is rejected when presented to the new
fetch the value of lease_time on the new (i.e. destination) server, server, the client should fetch the value of lease_time on the new
and use it for subsequent locking requests. However the server must (i.e. destination) server, and use it for subsequent locking
respect a grace period at least as long as the lease_time on the requests. However the server must respect a grace period at least as
source server, in order to ensure that clients have ample time to long as the lease_time on the source server, in order to ensure that
reclaim their lock before potentially conflicting non-reclaimed locks clients have ample time to reclaim their lock before potentially
are granted. conflicting non-reclaimed locks are granted.
10.6.7. Write Verifiers and File System Transitions 10.6.8. Write Verifiers and File System Transitions
In a file system transition, the two file systems may be clustered in In a file system transition, the two file systems may be clustered in
the handling of unstably written data. When this is the case, and the handling of unstably written data. When this is the case, and
the two file systems belong to the same _verifier_ class, valid the two file systems belong to the same _verifier_ class, valid
verifiers from one system may be recognized by the other and verifiers from one system may be recognized by the other and
superfluous writes avoided. There is no requirement that all valid superfluous writes avoided. There is no requirement that all valid
verifiers be recognized, but it cannot be the case that a verifier is verifiers be recognized, but it cannot be the case that a verifier is
recognized as valid when it is not. [NOTE: We need to resolve the recognized as valid when it is not. [NOTE: We need to resolve the
issue of proper verifier scope]. issue of proper verifier scope].
skipping to change at page 186, line 47 skipping to change at page 195, line 5
Since fs_locations attribute lacks information defining various Since fs_locations attribute lacks information defining various
attributes of the various file system choices presented, it should attributes of the various file system choices presented, it should
only be interrogated and used when fs_locations_info is not only be interrogated and used when fs_locations_info is not
available. When fs_locations is used, information about the specific available. When fs_locations is used, information about the specific
locations should be assumed based on the following rules. locations should be assumed based on the following rules.
The following rules are general and apply irrespective of the The following rules are general and apply irrespective of the
context. context.
o When a DNS server name maps to multiple IP addresses, they should o All listed file system instances should be considered as of the
be considered identical, i.e. of the same _endpoint_ class. same _handle_ class, if and only if, the current fh_expire_type
attribute does not include the FH4_VOL_MIGRATION bit. Note that
o Except in the case of servers sharing an _endpoint_ class, all in the case of referral, filehandle issues do not apply since
listed servers should be considered as of the same _handle_ class, there can be no filehandles known within the current file system
if and only if, the current fh_expire_type attribute does not nor is there any access to the fh_expire_type attribute on the
include the FH4_VOL_MIGRATION bit. Note that in the case of referring (absent) file system.
referral, filehandle issues do not apply since there can be no
filehandles known within the current file system nor is there any
access to the fh_expire_type attribute on the referring (absent)
file system.
o Except in the case of servers sharing an _endpoint_ class, all o All listed file system instances should be considered as of the
listed servers should be considered as of the same _fileid_ class, same _fileid_ class, if and only if, the fh_expire_type attribute
if and only if, the fh_expire_type attribute indicates persistent indicates persistent filehandles and does not include the
filehandles and does not include the FH4_VOL_MIGRATION bit. Note FH4_VOL_MIGRATION bit. Note that in the case of referral, fileid
that in the case of referral, fileid issues do not apply since issues do not apply since there can be no fileids known within the
there can be no fileids known within the referring (absent) file referring (absent) file system nor is there any access to the
system nor is there any access to the fh_expire_type attribute. fh_expire_type attribute.
o Except in the case of servers sharing an _endpoint_ class, all o All file system instances servers should be considered as of
listed servers should be considered as of different _change_ different _change_ classes.
classes.
For other class assignments, handling depends of file system For other class assignments, handling depends of file system
transitions depends on the reasons for the transition: transitions depends on the reasons for the transition:
o When the transition is due to migration, the target should be o When the transition is due to migration, the target should be
treated as being of the same _state_ and _verifier_ class as the treated as being of the same _verifier_ class as the source.
source.
o When the transition is due to failover to another replica, the o When the transition is due to failover to another replica, the
target should be treated as being of a different _state_ and target should be treated as being of a different _verifier_ class
_verifier_ class from the source. from the source.
The specific choices reflect typical implementation patterns for The specific choices reflect typical implementation patterns for
failover and controlled migration respectively. Since other choices failover and controlled migration respectively. Since other choices
are possible and useful, this information is better obtained by using are possible and useful, this information is better obtained by using
fs_locations_info. fs_locations_info.
See the section "Security Considerations" for a discussion on the See the section "Security Considerations" for a discussion on the
recommendations for the security flavor to be used by any GETATTR recommendations for the security flavor to be used by any GETATTR
operation that requests the "fs_locations" attribute. operation that requests the "fs_locations" attribute.
skipping to change at page 189, line 7 skipping to change at page 197, line 7
o Server-derived preference information for replicas, which can be o Server-derived preference information for replicas, which can be
used to implement load-balancing while giving the client the used to implement load-balancing while giving the client the
entire fs list to be used in case the primary fails. entire fs list to be used in case the primary fails.
The fs_locations_info attribute consists of a root pathname (just The fs_locations_info attribute consists of a root pathname (just
like fs_locations), together with an array of location4_item like fs_locations), together with an array of location4_item
structures. structures.
struct locations4_server { struct locations4_server {
int32_t currency; int32_t currency;
uint32_t info<>; opaque info<>;
utf8str_cis server; utf8str_cis server;
}; };
const LIBX_GFLAGS = 0; const LIBX_GFLAGS = 0;
const LIBX_TFLAGS = 1; const LIBX_TFLAGS = 1;
const LIBX_CLSHARE = 2; const LIBX_CLSIMUL = 2;
const LIBX_CLSERVER = 3; const LIBX_CLHANDLE = 3;
const LIBX_CLENDPOINT = 4; const LIBX_CLFILEID = 4;
const LIBX_CLHANDLE = 5; const LIBX_CLVERIFIER = 5;
const LIBX_CLFILEID = 6; const LIBX_CHANGE = 6;
const LIBX_CLVERIFIER = 7;
const LIBX_CLSTATE = 8;
const LIBX_READRANK = 9; const LIBX_READRANK = 7;
const LIBX_WRITERANK = 10; const LIBX_WRITERANK = 8;
const LIBX_READORDER = 11; const LIBX_READORDER = 9;
const LIBX_WRITEORDER = 12; const LIBX_WRITEORDER = 10;
const LIGF_WRITABLE = 0x01; const LIGF_WRITABLE = 0x01;
const LIGF_CUR_REQ = 0x02; const LIGF_CUR_REQ = 0x02;
const LIGF_ABSENT = 0x04; const LIGF_ABSENT = 0x04;
const LIGF_GOING = 0x08; const LIGF_GOING = 0x08;
const LIGF_SPLIT = 0x10;
const LITF_RDMA = 0x01; const LITF_RDMA = 0x01;
struct locations4_item { struct locations4_item {
locations4_server entries<>; locations4_server entries<>;
pathname4 rootpath; pathname4 rootpath;
}; };
struct locations4_info { struct locations4_info {
uint32_t info_flags;
pathname4 fs_root; pathname4 fs_root;
locations4_item items<>; locations4_item items<>;
}; };
const LIIF_VAR_SUB = 0x00000001;
The fs_locations_info attribute is structured similarly to the The fs_locations_info attribute is structured similarly to the
fs_locations attribute. A top-level structure (fs_locations4 or fs_locations attribute. A top-level structure (fs_locations4 or
locations4_info) contains the entire attribute including the root locations4_info) contains the entire attribute including the root
pathname of the fs and an array of lower-level structures that define pathname of the fs and an array of lower-level structures that define
replicas that share a common root path on their respective servers. replicas that share a common root path on their respective servers.
Those lower-level structures in turn (fs_locations4 or Those lower-level structures in turn (fs_locations4 or
location4_item) contain a specific pathname and information on one or location4_item) contain a specific pathname and information on one or
more individual server replicas. For that last lowest-level more individual server replicas. For that last lowest-level
information, fs_locations has a server name in the form of information, fs_locations has a server name in the form of
utf8str_cis, while fs_locations_info has a location4_server structure utf8str_cis, while fs_locations_info has a location4_server structure
that contains per-server-replica information in addition to the that contains per-server-replica information in addition to the
server name. server name.
As noted above, the fs_locations_info attribute, when supported, may
be requested of absent file systems without causing NFS4ERR_MOVED to
be returned and it is generally expected that it will be available
for both present and absent file systems even if only a single
location_server entry is present, designating the current (present)
file system, or two location_server entries designating the current
(and now previous) location of an absent file system and its
successor location. Servers are strongly urged to support this
attribute on all file systems if they support it on any file system.
10.10.1. The location4_server Structure
The location4_server structure consists of the following items: The location4_server structure consists of the following items:
o An indication of file system up-to-date-ness (currency) in terms o An indication of file system up-to-date-ness (currency) in terms
of approximate seconds before the present. A negative value of approximate seconds before the present. A negative value
indicates that the server is unable to give any reasonably useful indicates that the server is unable to give any reasonably useful
value here. A zero indicates that file system is the actual value here. A zero indicates that file system is the actual
writable data or a reliably coherent and fully up-to-date copy. writable data or a reliably coherent and fully up-to-date copy.
Positive values indicate how out- of-date this copy can normally Positive values indicate how out- of-date this copy can normally
be before it is considered for update. Such a value is not a be before it is considered for update. Such a value is not a
guarantee that such updates will always be performed on the guarantee that such updates will always be performed on the
required schedule but instead serve as a hint about how far behind required schedule but instead serve as a hint about how far behind
the most up-to-date copy of the data, this copy would normally be the most up-to-date copy of the data, this copy would normally be
expected to be. expected to be.
o A counted array of 32-but words containing various sorts of data, o A counted array of one-byte values containing various sorts of
about the particular file system instance. This data includes data, about the particular file system instance. This data
general flags, transport capability flags, file system equivalence includes general flags, transport capability flags, file system
class information, and selection priority information. The equivalence class information, and selection priority information.
encoding will be discussed below. The encoding will be discussed below.
o The server string. For the case of the replica currently being o The server string. For the case of the replica currently being
accessed (via GETATTR), a null string may be used to indicate the accessed (via GETATTR), a null string may be used to indicate the
current address being used for the RPC call. current address being used for the RPC call.
Data within the info array, is in the form of 8-bit data items even Data within the info array, is in the form of 8-bit data items with
though that array is, from XDR's point of view an array of 32-bit constants giving the offsets within the array of various values
integers. This definition was chosen because: describing this particular file system instance. This style of
definition was chosen, in preference to explicit XDR structure
definitions for these values for a number of reasons.
o The kinds of data in the info array, representing, flags, file o The kinds of data in the info array, representing, flags, file
system classes and priorities among set of file systems system classes and priorities among set of file systems
representing the same data are such that eight bits provides a representing the same data are such that eight bits provides a
quite acceptable range of values. Even where there might be more quite acceptable range of values. Even where there might be more
than 256 such file system instances, having more than 256 distinct than 256 such file system instances, having more than 256 distinct
classes or priorities is unlikely. classes or priorities is unlikely.
o XDR does not have any means to declare an 8-bit data type, other
than an ASCII string, and using 32-bit data types would lead to
significant space inefficiency.
o Explicit definition of the various specific data items within XDR o Explicit definition of the various specific data items within XDR
would limit expandability in that any extension within a would limit expandability in that any extension within a
subsequent minor version would require yet another attribute, subsequent minor version would require yet another attribute,
leading to specification and implementation clumsiness. leading to specification and implementation clumsiness.
o Such explicit definitions would also make it impossible to propose o Such explicit definitions would also make it impossible to propose
standards-track extensions apart from a full minor version. standards-track extensions apart from a full minor version.
Each 8-bit successive field within this array is designated by a This encoding scheme can be adapted to the specification of multi-
constant byte-index as defined above. More significant bit fields byte numeric values, even though none are currently defined. If
within a single word have successive indices with a transition to the extensions are made via standards-track RFC's, multi-byte quantities
next word following the most significant 8-bit field in each word. will be encoded as a range of bytes with a range of indices with the
bytes interpreted in network byte order.
The set of info data is subject to expansion in a future minor The set of info data is subject to expansion in a future minor
version, or in a standard-track RFC, within the context of a single version, or in a standard-track RFC, within the context of a single
minor version. The server SHOULD NOT send and the client MUST not minor version. The server SHOULD NOT send and the client MUST not
use indices within the info array that are not defined in standards- use indices within the info array that are not defined in standards-
track RFC's. track RFC's.
The following fragment of c++ code (with Doxygen-style comments)
illustrates how data items within the info array can be found using a
byte-index such as specified by the constants beginning with "LIBX_".
The associated InfoArray object is assume to be initialized with
"Length" containing the XDR-specified length in terms of 32-bit words
and "Data" containing the array of words encoded by the "info<>"
specification.
class InfoArray {
private:
uint32_t Length;
uint32_t Data[];
public:
uint8_t GetValue(int byteIndex);
};
/// @brief Get the value of a locations4_server info value
///
/// This method obtains the specific info value given a
/// byte index defined in the NFSv4.1 spec or another
/// later standards-track document.
///
/// @param[in] byteIndex The byte index identifying the
/// item requested.
/// @returns The value of the requested item.
uint8_t InfoArray::GetItem(int byteIndex) {
int wordIndex = byteIndex/4;
int byteWithinWord = byteIndex % 4;
if (wordIndex >= Length) {
return (0);
}
uint32_t ourWord = Data[wordIndex];
return ((ourWord >> (byteWithinWord*8)) & 0xff);
}
The info array contains within it: The info array contains within it:
o Two 8-bit flag fields, one devoted to general file-system o Two 8-bit flag fields, one devoted to general file-system
characteristics and a second reserved for transport-related characteristics and a second reserved for transport-related
capabilities. capabilities.
o Seven 8-bit class values which define various file system o Four 8-bit class values which define various file system
equivalence classes as explained below. equivalence classes as explained below.
o Four 8-bit priority values which govern file system selection as o Four 8-bit priority values which govern file system selection as
explained below. explained below.
The general file system characteristics flag (at byte index The general file system characteristics flag (at byte index
LIBX_GFLAGS) has the following bits defined within it: LIBX_GFLAGS) has the following bits defined within it:
o LIGF_WRITABLE indicates that this fs target is writable, allowing o LIGF_WRITABLE indicates that this fs target is writable, allowing
it to be selected by clients which may need to write on this file it to be selected by clients which may need to write on this file
skipping to change at page 193, line 46 skipping to change at page 200, line 39
not be used further. The client, if using it, should make an not be used further. The client, if using it, should make an
orderly transfer to another file system instance as expeditiously orderly transfer to another file system instance as expeditiously
as possible. It is expected that file systems going out of as possible. It is expected that file systems going out of
service will be announced as LIGF_GOING some time before the service will be announced as LIGF_GOING some time before the
actual loss of service and that the valid_for value will be actual loss of service and that the valid_for value will be
sufficiently small to allow clients to detect and act on scheduled sufficiently small to allow clients to detect and act on scheduled
events while large enough that the cost of the requests to fetch events while large enough that the cost of the requests to fetch
the fs_locations_info values will not be excessive. Values on the the fs_locations_info values will not be excessive. Values on the
order of ten minutes seem reasonable. order of ten minutes seem reasonable.
o LIGF_SPLIT indicates that when a transition occurs from the
current filesystem instance to this one, the replacement may
consist of multiple filesystems. In this case, the client has to
be prepared for the possibility that objects on the same fs before
migration will be on different ones after. Note that LIGF_SPLIT
is not incompatible with the filesystems belong to the same
_fileid_ class since, if one has a set of fileid's that are unique
within an fs, each subset assigned to a smaller fs after migration
would not have any conflicts internal to that fs.
A client, in the case of a split filesystem will interrogate
existing files with which it has continuing connection (it is free
simply forget cached filehandles). If the client remembers the
directory filehandle associated with each open file, it may
proceed upward using LOOKUPP to find the new fs boundaries.
Once the client recognizes that one filesystem has been split into
two, it could maintain applications running without disruption by
presenting the two filesystems as a single one until a convenient
point to recognize the transition, such as a reboot. This would
require a mapping of fsid's from the server's fsid's to fsid's as
seen by the client but this already necessary for other reasons
anyway. As noted above, existing fileids within the two
descendant fs's will not conflict. Creation of new files in the
two descendent fs's may require some amount of fileid mapping
which can be performed very simply in many important cases.
The transport-flag field (at byte index LIBX_TFLAGS) contains the The transport-flag field (at byte index LIBX_TFLAGS) contains the
following bits related to the transport capabilities of the specific following bits related to the transport capabilities of the specific
file system. file system.
o LITF_RDMA indicates that this file system provides NFSv4.1 file o LITF_RDMA indicates that this file system provides NFSv4.1 file
system access using an RDMA-capable transport. system access using an RDMA-capable transport.
Attribute continuity and file system identity information are Attribute continuity and file system identity information are
expressed by defining equivalence relations on the sets of file expressed by defining equivalence relations on the sets of file
systems presented to the client. Each such relation is expressed as systems presented to the client. Each such relation is expressed as
a set of file system equivalence classes. For each relation, a file a set of file system equivalence classes. For each relation, a file
system has an 8-bit class number. Two file systems belong to the system has an 8-bit class number. Two file systems belong to the
same class if both have identical non-zero class numbers. Zero is same class if both have identical non-zero class numbers. Zero is
treated as non-matching. Most often, the relevant question for the treated as non-matching. Most often, the relevant question for the
client will be whether a given replica is identical-with/ client will be whether a given replica is identical-to/
continuous-to the current one in a given respect but the information continuous-with the current one in a given respect but the
should be available also as to whether two other replicas match in information should be available also as to whether two other replicas
that respect as well. match in that respect as well.
The following fields specify the file system's class numbers for the The following fields specify the file system's class numbers for the
equivalence relations used in determining the nature of file system equivalence relations used in determining the nature of file system
transitions. See Section 10.6 for details about how this information transitions. See Section 10.6 for details about how this information
is to be used. is to be used.
o The field with byte-index LIBX_CLSHARE defines the sharing class o The field with byte-index LIBX_CLSIMUL defines the simultaneous-
for the file system. use class for the file system.
o The field with byte-index LIBX_CLSERVER defines the server class
for the file system.
o The field with byte-index LIBX_CLENDPOINT defines the endpoint
class for the file system.
o The field with byte-index LIBX_CLHANDLE defines the handle class o The field with byte-index LIBX_CLHANDLE defines the handle class
for the file system. for the file system.
o The field with byte-index LIBX_CLFILEID defines the fileid class o The field with byte-index LIBX_CLFILEID defines the fileid class
for the file system. for the file system.
o The field with byte-index LIBX_CLVERIFIER defines the verifier o The field with byte-index LIBX_CLVERIFIER defines the verifier
class for the file system. class for the file system.
o The field with byte-index LIBX_CLSTATE defines the state class for o The field with byte-index LIBX_CLCHANGE defines the change class
the file system. for the file system.
Server-specified preference information is also provided via 8-bit Server-specified preference information is also provided via 8-bit
values within the info array. The values provide a rank and an order values within the info array. The values provide a rank and an order
(see below) to be used with separate values specifiable for the cases (see below) to be used with separate values specifiable for the cases
of read-only and writable file systems. These values are compared of read-only and writable file systems. These values are compared
for different file systems to establish the server-specified for different file systems to establish the server-specified
preference, with lower values indicating "more preferred". preference, with lower values indicating "more preferred".
Rank is used to express a strict server-imposed ordering on clients, Rank is used to express a strict server-imposed ordering on clients,
with lower values indicating "more preferred." Clients should with lower values indicating "more preferred." Clients should
skipping to change at page 195, line 35 skipping to change at page 203, line 5
o The field at byte index LIBX_WRITEOREDER gives the order value to o The field at byte index LIBX_WRITEOREDER gives the order value to
be used for writable access. be used for writable access.
Depending on the potential need for write access by a given client, Depending on the potential need for write access by a given client,
one of the pairs of rank and order values is used. The read rank and one of the pairs of rank and order values is used. The read rank and
order should only be used if the client knows that only reading will order should only be used if the client knows that only reading will
ever be done or if it is prepared to switch to a different replica in ever be done or if it is prepared to switch to a different replica in
the event that any write access capability is required in the future. the event that any write access capability is required in the future.
10.10.2. The location4_info Structure
The locations4_info structure, encoding the fs_locations_info The locations4_info structure, encoding the fs_locations_info
attribute contains the following: attribute contains the following:
o The info_flags field which contains general flags that affect the
interpretation of this location4_info structures and all
location4_item structures within it. The only flag currently
defined is LIIF_VAR_SUB. All bits in flag field which are not
defined should always be returned as zero.
o The fs_root field which contains the pathname of the root of the o The fs_root field which contains the pathname of the root of the
current file system on the current server, just as it does the current file system on the current server, just as it does the
fs_locations4 structure. fs_locations4 structure.
o An array of locations4_item structures, which contain information o An array of locations4_item structures, which contain information
about replicas of the current file system. Where the current file about replicas of the current file system. Where the current file
system is actually present, or has been present, i.e. this is not system is actually present, or has been present, i.e. this is not
a referral situation, one of the locations4_item structure will a referral situation, one of the locations4_item structure will
contain a locations4_server for the current server. This contain a locations4_server for the current server. This
structure will have LIGF_ABSENT set if the current file system is structure will have LIGF_ABSENT set if the current file system is
skipping to change at page 196, line 12 skipping to change at page 203, line 38
o The valid_for field specifies a time for which it is reasonable o The valid_for field specifies a time for which it is reasonable
for a client to use the fs_locations_info attribute without for a client to use the fs_locations_info attribute without
refetch. The valid_for value does not provide a guarantee of refetch. The valid_for value does not provide a guarantee of
validity since servers can unexpectedly go out of service or validity since servers can unexpectedly go out of service or
become inaccessible for any number of reasons. Clients are well- become inaccessible for any number of reasons. Clients are well-
advised to refetch this information for actively accessed file advised to refetch this information for actively accessed file
system at every valid_for seconds. This is particularly important system at every valid_for seconds. This is particularly important
when file system replicas may go out of service in a controlled when file system replicas may go out of service in a controlled
way using the LIGF_GOING flag to communicate an ongoing change. way using the LIGF_GOING flag to communicate an ongoing change.
The server should set valid_for to a value which allows well- The server should set valid_for to a value which allows well-
behaved clients to notice the LIF_GOING flag and make an orderly behaved clients to notice the LIGF_GOING flag and make an orderly
switch before the loss of service becomes effective. If this switch before the loss of service becomes effective. If this
value is zero, then no refetch interval is appropriate and the value is zero, then no refetch interval is appropriate and the
client need not refetch this data on any particular schedule. In client need not refetch this data on any particular schedule. In
the event of a transition to a new file system instance, a new the event of a transition to a new file system instance, a new
value of the fs_locations_info attribute will be fetched at the value of the fs_locations_info attribute will be fetched at the
destination and it is to be expected that this may have a destination and it is to be expected that this may have a
different valid_for value, which the client should then use, in different valid_for value, which the client should then use, in
the same fashion as the previous value. the same fashion as the previous value.
As noted above, the fs_locations_info attribute, when supported, may The LIIF_VAR_SUB flag within info_flags controls whether variable
be requested of absent file systems without causing NFS4ERR_MOVED to substitution is to be enabled
be returned and it is generally expected that will be available for
both present and absent file systems even if only a single 10.10.3. The location4_item Structure
location_server entry is present, designating the current (present)
file system, or two location_server entries designating the current The location4_item structure contains a pathname (in the variable
(and now previous) location of an absent file system and its "rootpath") which encodes the path of the target filesystem replicas
successor location. Servers are strongly urged to support this on the set of server designated by the included location4_server
attribute on all file systems if they support it on any file system. entries. The precise manner in which this target location is
specified depends on the value of the LIIF_VAR_SUB flag within the
associated location4_info structure.
If this flag is not set, then rootpath simply designates the location
of the target filesystem within each server's single-server namespace
just as it does for the rootpath within the fs_location structure.
When this bit is set, however, component entries of a certain form
are subject to client-specific variable substitution so as to allow a
degree of namespace non-uniformity in order to accommodate the
selection of client-specific filesystem targets to adapt to different
client architectures or other characteristics.
When such substitution is in effect a variable beginning with the
string "${" and ending with the string "}" and containing a colon is
to be replaced by the client-specific value associated with that
variable. The string "unknown" should be used by the client when it
has no value for such a variable. The pathname resulting from such
substitutions is used to designate the target filesystem, so that
different clients may have different filesystems, corresponding to
that location in the multi-sever namespace.
As mentioned above, such substituted pathname variables contain a
colon. The part before the colon is to be a DNS domain name with the
part after being a case-insensitive alphanumeric string.
Where the domain is "ietf.org", only variable names defined in this
document or subsequent standards-track RFC's are subject to such
substitution. Organizations are free to use their domain names to
create their own sets of client-specific variables, to be subject to
such substitution. In case where such variables are intended to be
used more broadly than a single organization, publication of an
informational RFC defining such variables is recommended.
The variable ${ietf.org:CPU_ARCH} is used to denote the CPU
architecture object files are compiled. This specification does not
limit the acceptable values (except that they must be valid UTF-8
strings) but such values as "x86", "x86_64" and "sparc" would be
expected to be used in line with industry practice.
The variable ${ietf.org:OS_TYPE} is used to denote the operating
system and thus the kernel and library API's for which code might be
compiled. This specification does not limit the acceptable values
(except that they must be valid UTF-8 strings) but such values as
"linux" and "freebsd" would be expected to be used in line with
industry practice.
The variable ${ietf.org:OS_VERSION} is used to denote the operating
system version and the thus the specific details of versioned
interfaces for which code might be compiled. This specification does
not limit the acceptable values (except that they must be valid UTF-8
strings) but combinations of numbers and letters with interspersed
dots would be expected to be used in line with industry practice,
with the details of the version format depending on the specific
value of the value of the variable ${ietf.org:OS_TYPE} with which it
is used.
Use of these variable could result in direction of different clients
to different file systems on the same server, as appropriate to
particular clients. In cases in which the target filesystems are
located on different servers, a single server could serve as a
referral point so that each valid combination of variable values
would designate a referral hosted on a single server, with the
targets of those referrals on a number of different servers.
Although variable substitution is most suitable for use in the
context of referrals, if may be used in the context of replication
and migration. If it is used in these contexts, the server must
ensure that no matter what values the client presents for the
substituted variables, the result is always a valid successor file
system instance to that from which a transition is occurring, i.e.
that the data is identical or represents a later image of a writable
file system.
Note that when "rootpath" is a null pathname (that is, one with zero
components), the file system designated is at the root of the
specified server, whether the LIIF_VAR_SUB flag within the associated
location4_info structure is set or not.
10.11. The Attribute fs_status 10.11. The Attribute fs_status
In an environment in which multiple copies of the same basic set of In an environment in which multiple copies of the same basic set of
data are available, information regarding the particular source of data are available, information regarding the particular source of
such data and the relationships among different copies, can be very such data and the relationships among different copies, can be very
helpful in providing consistent data to applications. helpful in providing consistent data to applications.
enum fs4_status_type { enum fs4_status_type {
STATUS4_FIXED = 1, STATUS4_FIXED = 1,
STATUS4_UPDATED = 2, STATUS4_VERSIONED = 2,
STATUS4_VERSIONED = 3, STATUS4_UPDATED = 3,
STATUS4_WRITABLE = 4, STATUS4_WRITABLE = 4,
STATUS4_ABSENT = 5 STATUS4_ABSENT = 5
}; };
struct fs4_status { struct fs4_status {
fs4_status_type fsstat_type; fs4_status_type fsstat_type;
utf8str_cs fsstat_source; utf8str_cs fsstat_source;
utf8str_cs fsstat_current; utf8str_cs fsstat_current;
int32_t fsstat_age; int32_t fsstat_age;
nfstime4 fsstat_version; nfstime4 fsstat_version;
skipping to change at page 197, line 32 skipping to change at page 206, line 32
This is of particular importance when using the version values to This is of particular importance when using the version values to
determine appropriate succession of file system images. Five types determine appropriate succession of file system images. Five types
are distinguished: are distinguished:
o STATUS4_FIXED which indicates a read-only image in the sense that o STATUS4_FIXED which indicates a read-only image in the sense that
it will never change. The possibility is allowed that as a result it will never change. The possibility is allowed that as a result
of migration or switch to a different image, changed data can be of migration or switch to a different image, changed data can be
accessed but within the confines of this instance, no change is accessed but within the confines of this instance, no change is
allowed. The client can use this fact to aggressively cache. allowed. The client can use this fact to aggressively cache.
o STATUS4_VERSIONED which indicates that the image, like the
STATUS4_UPDATED case, is updated exogenously, but it provides a
guarantee that the server will carefully update an associated
version value so that the client can protect itself from a
situation in which it reads data from one version of the file
system, and then later reads data from an earlier version of the
same file system. See below for a discussion of how this can be
done.
o STATUS4_UPDATED which indicates an image that cannot be updated by o STATUS4_UPDATED which indicates an image that cannot be updated by
the user writing to it but may be changed exogenously, typically the user writing to it but may be changed exogenously, typically
because it is a periodically updated copy of another writable file because it is a periodically updated copy of another writable file
system somewhere else. system somewhere else. In this case, version information is not
provided and the client does not have the responsibility of making
o STATUS4_VERSIONED which indicates that the image, like the sure that this version only advances upon a file system instance
STATUS4_UPDATED case, is updated exogenously, but it provides a transition. In this case, it is the responsibility of the server
guarantee that the server will carefully update the associated to make sure that the data presented after a file system instance
version value so that the client, may if it chooses, protect transition is a proper successor image and includes all changes
itself from a situation in which it reads data from one version of seen by the client and any change made before all such changes.
the file system, and then later reads data from an earlier version
of the same file system. See below for a discussion of how this
can be done.
o STATUS4_WRITABLE which indicates that the file system is an actual o STATUS4_WRITABLE which indicates that the file system is an actual
writable one. The client need not of course actually write to the writable one. The client need not of course actually write to the
file system, but once it does, it should not accept a transition file system, but once it does, it should not accept a transition
to anything other than a writable instance of that same file to anything other than a writable instance of that same file
system. system.
o STATUS4_ABSENT which indicates that the information is the last o STATUS4_ABSENT which indicates that the information is the last
valid for a file system which is no longer present. valid for a file system which is no longer present.
skipping to change at page 203, line 35 skipping to change at page 212, line 37
11.5. Directory Delegation Recovery 11.5. Directory Delegation Recovery
Crash recovery has two main goals, avoiding the necessity of breaking Crash recovery has two main goals, avoiding the necessity of breaking
application guarantees with respect to locked files and delivery of application guarantees with respect to locked files and delivery of
updates cached at the client. Neither of these applies to updates cached at the client. Neither of these applies to
directories protected by read delegations and notifications. Thus, directories protected by read delegations and notifications. Thus,
the client is required to establish a new delegation on a server or the client is required to establish a new delegation on a server or
client reboot. [[Comment.14: we have special reclaim types allow client reboot. [[Comment.14: we have special reclaim types allow
clients to recovery delegations through client reboot. Do we really clients to recovery delegations through client reboot. Do we really
want CREATE_CLIENTID/CREATE_SESSION to destroy directory delegation want EXCHANGE_ID/CREATE_SESSION to destroy directory delegation
state?]] state?]]
12. Parallel NFS (pNFS) 12. Parallel NFS (pNFS)
12.1. Introduction 12.1. Introduction
The NFSv4.0 protocol [2] specifies the interaction between a client The NFSv4.0 protocol [2] specifies the interaction between a client
that accesses files and a server that provides access to files and is that accesses files and a server that provides access to files and is
responsible for coordinating access by multiple clients. As responsible for coordinating access by multiple clients. As
described in the pNFS problem statement, this requires that all described in the pNFS problem statement, this requires that all
skipping to change at page 204, line 31 skipping to change at page 213, line 33
||| | ||| |
||| | ||| |
||| Storage +-----------+ | ||| Storage +-----------+ |
||| Protocol |+-----------+ | ||| Protocol |+-----------+ |
||+----------------||+-----------+ Control| ||+----------------||+-----------+ Control|
|+-----------------||| | Protocol| |+-----------------||| | Protocol|
+------------------+|| Storage |------------+ +------------------+|| Storage |------------+
+| Devices | +| Devices |
+-----------+ +-----------+
Figure 60 Figure 61
In this structure, the responsibility for coordination of file access In this structure, the responsibility for coordination of file access
by multiple clients is shared among the server, clients, and storage by multiple clients is shared among the server, clients, and storage
devices. This is in contrast to NFSv4 without pNFS extensions, in devices. This is in contrast to NFSv4 without pNFS extensions, in
which this is primarily the server's responsibility, some of which which this is primarily the server's responsibility, some of which
can be delegated to clients under strictly specified conditions. can be delegated to clients under strictly specified conditions.
The pNFS extension to NFSv4 takes the form of new operations that The pNFS extension to NFSv4 takes the form of new operations that
manage data location information called a "layout". The layout is manage data location information called a "layout". The layout is
managed in a similar fashion as NFSv4 data delegations (e.g., they managed in a similar fashion as NFSv4 data delegations (e.g., they
skipping to change at page 206, line 24 skipping to change at page 215, line 25
is implemented by the extended (p)NFSv4 server. When the file system is implemented by the extended (p)NFSv4 server. When the file system
being exported by (p)NFSv4 uses storage devices that are visible to being exported by (p)NFSv4 uses storage devices that are visible to
clients over the network, the data path may be implemented by direct clients over the network, the data path may be implemented by direct
communication between the extended (p)NFSv4 file system client and communication between the extended (p)NFSv4 file system client and
the storage devices. This leads to a few new terms used to describe the storage devices. This leads to a few new terms used to describe
the protocol extension and some clarifications of existing terms. the protocol extension and some clarifications of existing terms.
12.2.1. Metadata Server 12.2.1. Metadata Server
A pNFS "server" or "metadata server" is a server as defined by A pNFS "server" or "metadata server" is a server as defined by
RFC3530 RFC3530 [2], which additionally provides support of the pNFS RFC3530 [2], which additionally provides support of the pNFS minor
minor extension. When using the pNFS NFSv4 minor extension, the extension. When using the pNFS NFSv4 minor extension, the metadata
metadata server may hold only the metadata associated with a file, server may hold only the metadata associated with a file, while the
while the data can be stored on the storage devices. However, data can be stored on the storage devices. However, similar to
similar to NFSv4, data may also be written through the metadata NFSv4, data may also be written through the metadata server. Note:
server. Note: directory data is always accessed through the metadata directory data is always accessed through the metadata server.
server.
12.2.2. Client 12.2.2. Client
A pNFS "client" is a client as defined by RFC3530 [2], with the A pNFS "client" is a client as defined by RFC3530 [2], with the
addition of supporting the pNFS minor extension server protocol and addition of supporting the pNFS minor extension server protocol and
with the addition of supporting at least one storage protocol for with the addition of supporting at least one storage protocol for
performing I/O directly to storage devices. performing I/O directly to storage devices.
12.2.3. Storage Device 12.2.3. Storage Device
skipping to change at page 226, line 13 skipping to change at page 235, line 13
the consequences of I/Os already in flight. the consequences of I/Os already in flight.
The issue of the effects of I/Os started before lease expiration and The issue of the effects of I/Os started before lease expiration and
possibly continuing through lease expiration is the responsibility of possibly continuing through lease expiration is the responsibility of
the data storage protocol and as such is layout type specific. There the data storage protocol and as such is layout type specific. There
are two approaches the data storage protocol can take. The protocol are two approaches the data storage protocol can take. The protocol
may adopt a global solution which prevents all I/Os from being may adopt a global solution which prevents all I/Os from being
executed after the lease expiration and thus is safe against a client executed after the lease expiration and thus is safe against a client
who issues I/Os after lease expiration. This is the preferred who issues I/Os after lease expiration. This is the preferred
solution and the solution used by NFSv4 file based layouts (see solution and the solution used by NFSv4 file based layouts (see
Section 12.4.6); as well, the object storage device protocol allows Section 12.4.7); as well, the object storage device protocol allows
storage to fence clients after lease expiration. Alternatively, the storage to fence clients after lease expiration. Alternatively, the
storage protocol may rely on proper client operation and only deal storage protocol may rely on proper client operation and only deal
with the effects of lingering I/Os. These solutions may impact the with the effects of lingering I/Os. These solutions may impact the
client layout-driver, the metadata server layout-driver, and the client layout-driver, the metadata server layout-driver, and the
control protocol. control protocol.
12.3.7.4. Storage Device Recovery 12.3.7.4. Storage Device Recovery
Storage device crash recovery is mostly dependent upon the layout Storage device crash recovery is mostly dependent upon the layout
type in use. However, there are a few general techniques a client type in use. However, there are a few general techniques a client
skipping to change at page 227, line 26 skipping to change at page 236, line 26
the original WRITE finally succeeds, the same issues can occur. the original WRITE finally succeeds, the same issues can occur.
However, this is solved by sessions in NFSv4.x. However, this is solved by sessions in NFSv4.x.
12.3.8. Security Considerations 12.3.8. Security Considerations
The pNFS extension partitions the NFSv4 file system protocol into two The pNFS extension partitions the NFSv4 file system protocol into two
parts, the control path and the data path (i.e., storage protocol). parts, the control path and the data path (i.e., storage protocol).
The control path contains all the new operations described by this The control path contains all the new operations described by this
extension; all existing NFSv4 security mechanisms and features apply extension; all existing NFSv4 security mechanisms and features apply
to the control path. The combination of components in a pNFS system to the control path. The combination of components in a pNFS system
(see Figure 60) is required to preserve the security properties of (see Figure 61) is required to preserve the security properties of
NFSv4 with respect to an entity accessing data via a client, NFSv4 with respect to an entity accessing data via a client,
including security countermeasures to defend against threats that including security countermeasures to defend against threats that
NFSv4 provides defenses for in environments where these threats are NFSv4 provides defenses for in environments where these threats are
considered significant. considered significant.
In some cases, the security countermeasures for connections to In some cases, the security countermeasures for connections to
storage devices may take the form of physical isolation or a storage devices may take the form of physical isolation or a
recommendation not to use pNFS in an environment. For example, it is recommendation not to use pNFS in an environment. For example, it is
currently infeasible to provide confidentiality protection for some currently infeasible to provide confidentiality protection for some
storage device access protocols to protect against eavesdropping; in storage device access protocols to protect against eavesdropping; in
skipping to change at page 228, line 15 skipping to change at page 237, line 15
access respects NFSv4 ACLs and file open modes. This entails access respects NFSv4 ACLs and file open modes. This entails
performing both of these checks on every access in the client, the performing both of these checks on every access in the client, the
storage device, or both. If a pNFS configuration performs these storage device, or both. If a pNFS configuration performs these
checks only in the client, the risk of a misbehaving client obtaining checks only in the client, the risk of a misbehaving client obtaining
unauthorized access is an important consideration in determining when unauthorized access is an important consideration in determining when
it is appropriate to use such a pNFS configuration. Such it is appropriate to use such a pNFS configuration. Such
configurations SHOULD NOT be used when client- only access checks do configurations SHOULD NOT be used when client- only access checks do
not provide sufficient assurance that NFSv4 access control is being not provide sufficient assurance that NFSv4 access control is being
applied correctly. applied correctly.
12.4. The NFSv4 File Layout Type 12.4. The NFSv4.1 File Layout Type
This section describes the semantics and format of NFSv4 file-based This section describes the semantics and format of NFSv4.1 file-based
layouts. layouts.
12.4.1. File Striping and Data Access 12.4.1. Session Considerations
Sessions are a mandatory feature of NFSv4.1, and this extends to both
the metadata server and file-based data servers. If data is served
by both the metadata server and a data server, the metadata and data
server:
o MUST share the same clientid.
o MUST have separate sessions (unless the metadata server and data
server are the same entity). Both sessions MUST be associated
with the same clientid.
It is legal for a data server to also act as a metadata server. A
server serving both roles will provide service for one set of file
systems in one role, and a another possibly intersecting, possibly
disjoint set of filesystems in the other role. A client using a
server serving both roles is free to use the same clientid and
sessionid when interacting with either of the server's roles.
12.4.2. File Striping and Data Access
The file layout type describes a method for striping data across The file layout type describes a method for striping data across
multiple devices. The data for each stripe unit is stored within an multiple devices. The data for each stripe unit is stored within an
NFSv4 file located on a particular storage device. NFSv4.1 file located on a particular storage device.
Before discussing the file layout, it is necessary to describe the Before discussing the file layout, it is necessary to describe the
file layout device type; the structures are as follows: file layout device type; the structures are as follows:
typedef netaddr4 nfsv4_file_layout_simple_device4; typedef netaddr4 nfsv4_file_layout_simple_device4;
enum file_layout_device_type { enum file_layout_device_type {
FILE_SIMPLE = 1, FILE_SIMPLE = 1,
FILE_COMPLEX = 2 FILE_COMPLEX = 2
}; };
skipping to change at page 228, line 50 skipping to change at page 238, line 26
case FILE_COMPLEX: case FILE_COMPLEX:
deviceid4 dev_list<>; deviceid4 dev_list<>;
default: default:
void; void;
}; };
The "nfsv4_file_layout_device4" structure is a union composed of a The "nfsv4_file_layout_device4" structure is a union composed of a
SIMPLE or a COMPLEX device type. A Simple device is composed of an SIMPLE or a COMPLEX device type. A Simple device is composed of an
array of nfsv4_file_layout_simple_device4 structures. All devices array of nfsv4_file_layout_simple_device4 structures. All devices
identified by a Simple device must be 'equivalent' and are used for identified by a Simple device must be 'equivalent' and are used for
device multipathing; see Section 12.4.1.3 for more details on device multipathing; see Section 12.4.2.3 for more details on
equivalent devices. Simple devices always refer to actual physical equivalent devices. Simple devices always refer to actual physical
devices. On the other hand, a Complex device is a virtual device devices. On the other hand, a Complex device is a virtual device
that is constructed of multiple Simple devices. Each device within that is constructed of multiple Simple devices. Each device within
the Complex device list is identified by its device ID. A Complex the Complex device list is identified by its device ID. A Complex
device MUST NOT reference other Complex devices; only Simple devices device MUST NOT reference other Complex devices; only Simple devices
are to be referenced. This enables multiple physical devices to be are to be referenced. This enables multiple physical devices to be
identified through a single device ID and provides a space efficient identified through a single device ID and provides a space efficient
mechanism by which to identify multiple devices within a layout. mechanism by which to identify multiple devices within a layout.
Complex devices can be thought of as a table of devices. Complex and Complex devices can be thought of as a table of devices. Complex and
Simple devices share the same device ID space and should be cached Simple devices share the same device ID space and should be cached
skipping to change at page 231, line 43 skipping to change at page 241, line 31
data file on a storage device. It allows for two different data data file on a storage device. It allows for two different data
layouts: sparse and dense or packed. The stripe type determines the layouts: sparse and dense or packed. The stripe type determines the
calculation that must be made to map the client visible file offset calculation that must be made to map the client visible file offset
to the offset within the data file located on the storage device. to the offset within the data file located on the storage device.
The layout hint structure is described in more detail in Section The layout hint structure is described in more detail in Section
4.15. It is used, by the client, as by the FILE_LAYOUT_HINT 4.15. It is used, by the client, as by the FILE_LAYOUT_HINT
attribute to specify the type of layout to be used for a newly attribute to specify the type of layout to be used for a newly
created file. created file.
12.4.1.1. Sparse and Dense Storage Device Data Layouts 12.4.2.1. Sparse and Dense Storage Device Data Layouts
The stripe_type field allows for two storage device data file The stripe_type field allows for two storage device data file
representations. Example sparse and dense storage device data representations. Example sparse and dense storage device data
layouts are illustrated below: layouts are illustrated below:
Sparse file-layout (stripe_unit = 4KB) Sparse file-layout (stripe_unit = 4KB)
------------------ ------------------
Is represented by the following file layout on the storage devices: Is represented by the following file layout on the storage devices:
skipping to change at page 233, line 19 skipping to change at page 243, line 4
dense storage device layouts is: dense storage device layouts is:
stripe_width = stripe_unit * N; where N = |dev_list| stripe_width = stripe_unit * N; where N = |dev_list|
dev_offset = floor(file_offset / stripe_width) * stripe_unit + dev_offset = floor(file_offset / stripe_width) * stripe_unit +
file_offset % stripe_unit file_offset % stripe_unit
Regardless of the storage device data file layout, the calculation to Regardless of the storage device data file layout, the calculation to
determine the index into the device array is the same: determine the index into the device array is the same:
dev_idx = floor(file_offset / stripe_unit) mod N dev_idx = floor(file_offset / stripe_unit) mod N
Section 12.4.6 describe the semantics for dealing with reads to holes
Section 12.4.5 describe the semantics for dealing with reads to holes
within the striped file. This is of particular concern, since each within the striped file. This is of particular concern, since each
individual component stripe file (i.e., the component of the striped individual component stripe file (i.e., the component of the striped
file that lives on a particular storage device) may be of different file that lives on a particular storage device) may be of different
length. Thus, clients may experience 'short' reads when reading off length. Thus, clients may experience 'short' reads when reading off
the end of one of these component files. the end of one of these component files.
12.4.1.2. Metadata and Storage Device Roles 12.4.2.2. Metadata and Storage Device Roles
In many cases, the metadata server and the storage device will be In many cases, the metadata server and the storage device will be
separate pieces of physical hardware. The specification text is separate pieces of physical hardware. The specification text is
written as if that were always case. However, it can be the case written as if that were always case. However, it can be the case
that the same physical hardware is used to implement both a metadata that the same physical hardware is used to implement both a metadata
and storage device and in this case, the specification text's and storage device and in this case, the specification text's
references to these two entities are to be understood as referring to references to these two entities are to be understood as referring to
the same physical hardware implementing two distinct roles and it is the same physical hardware implementing two distinct roles and it is
important that it be clearly understood on behalf of which role the important that it be clearly understood on behalf of which role the
hardware is executing at any given time. hardware is executing at any given time.
skipping to change at page 234, line 31 skipping to change at page 244, line 14
If a current filehandle is set that is inconsistent with the role to If a current filehandle is set that is inconsistent with the role to
which it is directed, then the error NFS4ERR_BADHANDLE should result. which it is directed, then the error NFS4ERR_BADHANDLE should result.
For example, if a request is directed at the storage device, because For example, if a request is directed at the storage device, because
the first current handle is from a layout, any attempt to set the the first current handle is from a layout, any attempt to set the
current filehandle to be a value not from a layout should be current filehandle to be a value not from a layout should be
rejected. Similarly, if the first current file handle was for a rejected. Similarly, if the first current file handle was for a
value not from a layout, a subsequent attempt to set the current value not from a layout, a subsequent attempt to set the current
filehandle to a value obtained from a layout should be rejected. filehandle to a value obtained from a layout should be rejected.
12.4.1.3. Device Multipathing 12.4.2.3. Device Multipathing
The NFSv4 file layout supports multipathing to 'equivalent' devices. The NFSv4.1 file layout supports multipathing to 'equivalent'
Device-level multipathing is primarily of use in the case of a data devices. Device-level multipathing is primarily of use in the case
server failure --- it allows the client to switch to another storage of a data server failure --- it allows the client to switch to
device that is exporting the same data stripe, without having to another storage device that is exporting the same data stripe,
contact the metadata server for a new layout. without having to contact the metadata server for a new layout.
To support device multipathing, an array of device IDs is encoded To support device multipathing, an array of device IDs is encoded
within the SIMPLE case of the nfsv4_file_layout_device4 union. This within the SIMPLE case of the nfsv4_file_layout_device4 union. This
array represents an ordered list of devices where the first element array represents an ordered list of devices where the first element
has the highest priority. Each device in the list MUST be has the highest priority. Each device in the list MUST be
'equivalent' to every other device in the list and each device must 'equivalent' to every other device in the list and each device must
be attempted in the order specified. be attempted in the order specified.
Equivalent devices MUST export the same system image (e.g., the Equivalent devices MUST export the same system image (e.g., the
stateids and filehandles that they use are the same) and must provide stateids and filehandles that they use are the same) and must provide
skipping to change at page 235, line 10 skipping to change at page 244, line 42
also have sufficient connections to the storage, such that writing to also have sufficient connections to the storage, such that writing to
one storage device is equivalent to writing to another, this also one storage device is equivalent to writing to another, this also
applies to reading. Also, if multiple copies of the same data exist, applies to reading. Also, if multiple copies of the same data exist,
reading from one must provide access to all existing copies. As reading from one must provide access to all existing copies. As
such, it is unlikely that multipathing will provide additional such, it is unlikely that multipathing will provide additional
benefit in the case of an I/O error. benefit in the case of an I/O error.
[NOTE: the error cases in which a client is expected to attempt an [NOTE: the error cases in which a client is expected to attempt an
equivalent storage device should be specified.] equivalent storage device should be specified.]
12.4.1.4. Operations Issued to Storage Devices 12.4.2.4. Operations Issued to Storage Devices
Clients MUST use the filehandle described within the layout when Clients MUST use the filehandle described within the layout when
accessing data on the storage devices. When using the layout's accessing data on the storage devices. When using the layout's
filehandle, the client MUST only issue BIND_BACKCHANNEL, filehandle, the client MUST only issue the NULL procedure and the
BIND_CONN_TO_SESSION, CREATE_SESSION, COMMIT, DESTROY_SESSION, NULL, BACKCHANNEL_CTL, BIND_CONN_TO_SESSION, CREATE_SESSION, COMMIT,
READ, WRITE, PUTFH, SECINFO_NO_NAME, SET_SSV, and SEQUENCE operations DESTROY_SESSION, READ, WRITE, PUTFH, SECINFO_NO_NAME, SET_SSV, and
to the storage device associated with that filehandle. If a client SEQUENCE operations to the storage device associated with that
issues an operation other than those specified above, using the filehandle. If a client issues an operation other than those
filehandle and storage device listed in the client's layout, that specified above, using the filehandle and storage device listed in
storage device SHOULD return an error to the client. The client MUST the client's layout, that storage device SHOULD return an error to
follow the instruction implied by the layout (i.e., which filehandles the client. The client MUST follow the instruction implied by the
to use on which devices). As described in Section 12.3.2, a client layout (i.e., which filehandles to use on which devices). As
MUST NOT issue I/Os to storage devices for which it does not hold a described in Section 12.3.2, a client MUST NOT issue I/Os to storage
valid layout. The storage devices may reject such requests. devices for which it does not hold a valid layout. The storage
devices may reject such requests.
GETATTR and SETATTR MUST be directed to the metadata server. In the GETATTR and SETATTR MUST be directed to the metadata server. In the
case of a SETATTR of the size attribute, the control protocol is case of a SETATTR of the size attribute, the control protocol is
responsible for propagating size updates/truncations to the storage responsible for propagating size updates/truncations to the storage
devices. In the case of extending WRITEs to the storage devices, the devices. In the case of extending WRITEs to the storage devices, the
new size must be visible on the metadata server once a LAYOUTCOMMIT new size must be visible on the metadata server once a LAYOUTCOMMIT
has completed (see Section 12.3.4.2). Section 12.4.5, describes the has completed (see Section 12.3.4.2). Section 12.4.6, describes the
mechanism by which the client is to handle storage device file's that mechanism by which the client is to handle storage device file's that
do not reflect the metadata server's size. do not reflect the metadata server's size.
12.4.1.5. COMMIT through metadata server 12.4.2.5. COMMIT through metadata server
commit_through_mds in the file layout gives the metadata server a commit_through_mds in the file layout gives the metadata server a
preferred way of performing COMMIT. If this flag is true, the client preferred way of performing COMMIT. If this flag is true, the client
SHOULD send COMMIT to the metadata server instead of sending it to SHOULD send COMMIT to the metadata server instead of sending it to
the same data server to which the associated WRITEs were sent. In the same data server to which the associated WRITEs were sent. In
order to maintain the current NFSv4 commit and recovery model, all order to maintain the current NFSv4.1 commit and recovery model, all
the data servers MUST return a common verifier for all WRITEs in a the data servers MUST return a common verifier for all WRITEs in a
given file layout. The value of the write verifier MUST be changed given file layout. The value of the write verifier MUST be changed
at the metadata server or any data server that is referenced in the at the metadata server or any data server that is referenced in the
layout, whenever there is a server event that can possibly lead to layout, whenever there is a server event that can possibly lead to
loss of uncommitted data. The scope of the verifier can be for a loss of uncommitted data. The scope of the verifier can be for a
file or for the entire pNFS server. It might be more difficult for file or for the entire pNFS server. It might be more difficult for
the server to maintain the verifier at the file level but the benefit the server to maintain the verifier at the file level but the benefit
is that only events that impact a given file will require recovery is that only events that impact a given file will require recovery
action. action.
skipping to change at page 236, line 20 skipping to change at page 246, line 5
verifier as WRITE failure and try to recover by reissuing the WRITEs verifier as WRITE failure and try to recover by reissuing the WRITEs
to the original DS or using other path to that data if the layout has to the original DS or using other path to that data if the layout has
not been recalled. Other option the client has is getting a new not been recalled. Other option the client has is getting a new
layout or just rewrite the data through the metadata server. If the layout or just rewrite the data through the metadata server. If the
flag commit_through_mds is false the client should not send COMMIT to flag commit_through_mds is false the client should not send COMMIT to
the metadata server. Although it is valid to send COMMIT to the the metadata server. Although it is valid to send COMMIT to the
metadata server it should be used only to commit data that was metadata server it should be used only to commit data that was
written through the metadata server. See also section 14.7.4 written through the metadata server. See also section 14.7.4
"Storage Device Recover" for recovery options. "Storage Device Recover" for recovery options.
12.4.2. Global Stateid Requirements 12.4.3. Global Stateid Requirements
Note, there are no stateids returned embedded within the layout. The Note, there are no stateids returned embedded within the layout. The
client MUST use the stateid representing open or lock state as client MUST use the stateid representing open or lock state as
returned by an earlier metadata operation (e.g., OPEN, LOCK), or a returned by an earlier metadata operation (e.g., OPEN, LOCK), or a
special stateid to perform I/O on the storage devices, as in regular special stateid to perform I/O on the storage devices, as in regular
NFSv4. Special stateid usage for I/O is subject to the NFSv4 NFSv4. Special stateid usage for I/O is subject to the NFSv4
protocol specification. The stateid used for I/O MUST have the same protocol specification. The stateid used for I/O MUST have the same
effect and be subject to the same validation on storage device as it effect and be subject to the same validation on storage device as it
would if the I/O was being performed on the metadata server itself in would if the I/O was being performed on the metadata server itself in
the absence of pNFS. This has the implication that stateids are the absence of pNFS. This has the implication that stateids are
globally valid on both the metadata and storage devices. This globally valid on both the metadata and storage devices. This
requires the metadata server to propagate changes in lock and open requires the metadata server to propagate changes in lock and open
state to the storage devices, so that the storage devices can state to the storage devices, so that the storage devices can
validate I/O accesses. This is discussed further in Section 12.4.4. validate I/O accesses. This is discussed further in Section 12.4.5.
Depending on when stateids are propagated, the existence of a valid Depending on when stateids are propagated, the existence of a valid
stateid on the storage device may act as proof of a valid layout. stateid on the storage device may act as proof of a valid layout.
[NOTE: a number of proposals have been made that have the possibility [NOTE: a number of proposals have been made that have the possibility
of limiting the amount of validation performed by the storage device, of limiting the amount of validation performed by the storage device,
if any of these proposals are accepted or obtain consensus, the if any of these proposals are accepted or obtain consensus, the
global stateid requirement can be revisited.] global stateid requirement can be revisited.]
12.4.3. The Layout Iomode 12.4.4. The Layout Iomode
The layout iomode need not used by the metadata server when servicing The layout iomode need not used by the metadata server when servicing
NFSv4 file-based layouts, although in some circumstances it may be NFSv4.1 file-based layouts, although in some circumstances it may be
useful to use. For example, if the server implementation supports useful to use. For example, if the server implementation supports
reading from read-only replicas or mirrors, it would be useful for reading from read-only replicas or mirrors, it would be useful for
the server to return a layout enabling the client to do so. As such, the server to return a layout enabling the client to do so. As such,
the client should set the iomode based on its intent to read or write the client should set the iomode based on its intent to read or write
the data. The client may default to an iomode of READ/WRITE the data. The client may default to an iomode of READ/WRITE
(LAYOUTIOMODE_RW). The iomode need not be checked by the storage (LAYOUTIOMODE_RW). The iomode need not be checked by the storage
devices when clients perform I/O. However, the storage devices SHOULD devices when clients perform I/O. However, the storage devices SHOULD
still validate that the client holds a valid layout and return an still validate that the client holds a valid layout and return an
error if the client does not. error if the client does not.
12.4.4. Storage Device State Propagation 12.4.5. Storage Device State Propagation
Since the metadata server, which handles lock and open-mode state Since the metadata server, which handles lock and open-mode state
changes, as well as ACLs, may not be co-located with the storage changes, as well as ACLs, may not be co-located with the storage
devices where I/O access are validated, as such, the server devices where I/O access are validated, as such, the server
implementation MUST take care of propagating changes of this state to implementation MUST take care of propagating changes of this state to
the storage devices. Once the propagation to the storage devices is the storage devices. Once the propagation to the storage devices is
complete, the full effect of those changes must be in effect at the complete, the full effect of those changes must be in effect at the
storage devices. However, some state changes need not be propagated storage devices. However, some state changes need not be propagated
immediately, although all changes SHOULD be propagated promptly. immediately, although all changes SHOULD be propagated promptly.
These state propagations have an impact on the design of the control These state propagations have an impact on the design of the control
protocol, even though the control protocol is outside of the scope of protocol, even though the control protocol is outside of the scope of
this specification. Immediate propagation refers to the synchronous this specification. Immediate propagation refers to the synchronous
propagation of state from the metadata server to the storage propagation of state from the metadata server to the storage
device(s); the propagation must be complete before returning to the device(s); the propagation must be complete before returning to the
client. client.
12.4.4.1. Lock State Propagation 12.4.5.1. Lock State Propagation
Mandatory locks MUST be made effective at the storage devices before Mandatory locks MUST be made effective at the storage devices before
the request that establishes them returns to the caller. Thus, the request that establishes them returns to the caller. Thus,
mandatory lock state MUST be synchronously propagated to the storage mandatory lock state MUST be synchronously propagated to the storage
devices. On the other hand, since advisory lock state is not used devices. On the other hand, since advisory lock state is not used
for checking I/O accesses at the storage devices, there is no for checking I/O accesses at the storage devices, there is no
semantic reason for propagating advisory lock state to the storage semantic reason for propagating advisory lock state to the storage
devices. However, since all lock, unlock, open downgrades and devices. However, since all lock, unlock, open downgrades and
upgrades affect the sequence ID stored within the stateid, the upgrades affect the sequence ID stored within the stateid, the
stateid changes which may cause difficulty if this state is not stateid changes which may cause difficulty if this state is not
skipping to change at page 238, line 7 skipping to change at page 247, line 40
Since updates to advisory locks neither confer nor remove privileges, Since updates to advisory locks neither confer nor remove privileges,
these changes need not be propagated immediately, and may not need to these changes need not be propagated immediately, and may not need to
be propagated promptly. The updates to advisory locks need only be be propagated promptly. The updates to advisory locks need only be
propagated when the storage device needs to resolve a question about propagated when the storage device needs to resolve a question about
a stateid. In fact, if byte-range locking is not mandatory (i.e., is a stateid. In fact, if byte-range locking is not mandatory (i.e., is
advisory) the clients are advised not to use the lock-based stateids advisory) the clients are advised not to use the lock-based stateids
for I/O at all. The stateids returned by open are sufficient and for I/O at all. The stateids returned by open are sufficient and
eliminate overhead for this kind of state propagation. eliminate overhead for this kind of state propagation.
12.4.4.2. Open-mode Validation 12.4.5.2. Open-mode Validation
Open-mode validation MUST be performed against the open mode(s) held Open-mode validation MUST be performed against the open mode(s) held
by the storage devices. However, the server implementation may not by the storage devices. However, the server implementation may not
always require the immediate propagation of changes. Reduction in always require the immediate propagation of changes. Reduction in
access because of CLOSEs or DOWNGRADEs do not have to be propagated access because of CLOSEs or DOWNGRADEs do not have to be propagated
immediately, but SHOULD be propagated promptly; whereas changes due immediately, but SHOULD be propagated promptly; whereas changes due
to revocation MUST be propagated immediately. On the other hand, to revocation MUST be propagated immediately. On the other hand,
changes that expand access (e.g., new OPEN's and upgrades) don't have changes that expand access (e.g., new OPEN's and upgrades) don't have
to be propagated immediately but the storage device SHOULD NOT reject to be propagated immediately but the storage device SHOULD NOT reject
a request because of mode issues without making sure that the upgrade a request because of mode issues without making sure that the upgrade
is not in flight. is not in flight.
12.4.4.3. File Attributes 12.4.5.3. File Attributes
Since the SETATTR operation has the ability to modify state that is Since the SETATTR operation has the ability to modify state that is
visible on both the metadata and storage devices (e.g., the size), visible on both the metadata and storage devices (e.g., the size),
care must be taken to ensure that the resultant state across the set care must be taken to ensure that the resultant state across the set
of storage devices is consistent; especially when truncating or of storage devices is consistent; especially when truncating or
growing the file. growing the file.
As described earlier, the LAYOUTCOMMIT operation is used to ensure As described earlier, the LAYOUTCOMMIT operation is used to ensure
that the metadata is synced with changes made to the storage devices. that the metadata is synced with changes made to the storage devices.
For the file-based protocol, it is necessary to re-sync state such as For the file-based protocol, it is necessary to re-sync state such as
skipping to change at page 239, line 13 skipping to change at page 248, line 45
appropriate access checking on the READs and WRITEs themselves. appropriate access checking on the READs and WRITEs themselves.
This also includes changes to ACLs. The propagation of access right This also includes changes to ACLs. The propagation of access right
changes due to changes in ACLs may be asynchronous only if the server changes due to changes in ACLs may be asynchronous only if the server
implementation is able to determine that the updated ACL is not more implementation is able to determine that the updated ACL is not more
restrictive for any user specified in the old ACL. Due to the restrictive for any user specified in the old ACL. Due to the
relative infrequency of ACL updates, it is suggested that all changes relative infrequency of ACL updates, it is suggested that all changes
be propagated synchronously. be propagated synchronously.
[NOTE: it has been suggested that the NFSv4 specification is in error [NOTE: it has been suggested that the NFSv4 specification is in error
with regard to allowing principles other than those used for OPEN to with regard to allowing principals other than those used for OPEN to
be used for file I/O. If changes within a minor version alter the be used for file I/O. If changes within a minor version alter the
behavior of NFSv4 with regard to OPEN principals and stateids some behavior of NFSv4 with regard to OPEN principals and stateids some
access control checking at the storage device can be made less access control checking at the storage device can be made less
expensive. pNFS should be altered to take full advantage of these expensive. pNFS should be altered to take full advantage of these
changes.] changes.]
12.4.5. Storage Device Component File Size 12.4.6. Storage Device Component File Size
A potential problem exists when a component data file on a particular A potential problem exists when a component data file on a particular
storage device is grown past EOF; the problem exists for both dense storage device is grown past EOF; the problem exists for both dense
and sparse layouts. Imagine the following scenario: a client creates and sparse layouts. Imagine the following scenario: a client creates
a new file (size == 0) and writes to byte 128KB; the client then a new file (size == 0) and writes to byte 128KB; the client then
seeks to the beginning of the file and reads byte 100. The client seeks to the beginning of the file and reads byte 100. The client
should receive 0s back as a result of the read. However, if the read should receive 0s back as a result of the read. However, if the read
falls on a different storage device to the client's original write, falls on a different storage device to the client's original write,
the storage device servicing the READ may still believe that the the storage device servicing the READ may still believe that the
file's size is at 0 and return no data with the EOF flag set. The file's size is at 0 and return no data with the EOF flag set. The
skipping to change at page 240, line 5 skipping to change at page 249, line 35
The NFS protocol only provides close to open file data cache The NFS protocol only provides close to open file data cache
semantics; meaning that when the file is closed all modified data is semantics; meaning that when the file is closed all modified data is
written to the NFS server. When a subsequent open of the file is written to the NFS server. When a subsequent open of the file is
done, the change time is inspected for a difference from a cached done, the change time is inspected for a difference from a cached
value for the change time. For the case above, this means that a value for the change time. For the case above, this means that a
LAYOUTCOMMIT will be done at close (along with the data writes) and LAYOUTCOMMIT will be done at close (along with the data writes) and
will update the file's size and change time. Access from another will update the file's size and change time. Access from another
client after that point will result in the appropriate size being client after that point will result in the appropriate size being
returned. returned.
12.4.6. Crash Recovery Considerations 12.4.7. Crash Recovery Considerations
As described in Section 12.3.7, the layout type specific storage As described in Section 12.3.7, the layout type specific storage
protocol is responsible for handling the effects of I/Os started protocol is responsible for handling the effects of I/Os started
before lease expiration, extending through lease expiration. The before lease expiration, extending through lease expiration. The
NFSv4 file layout type prevents all I/Os from being executed after NFSv4.1 file layout type prevents all I/Os from being executed after
lease expiration, without relying on a precise client lease timer and lease expiration, without relying on a precise client lease timer and
without requiring storage devices to maintain lease timers. without requiring storage devices to maintain lease timers.
It works as follows. In the presence of sessions, each compound It works as follows. In the presence of sessions, each compound
begins with a SEQUENCE operation that contains the "clientID". On begins with a SEQUENCE operation that contains the "clientID". On
the storage device, the clientID can be used to validate that the the storage device, the clientID can be used to validate that the
client has a valid layout for the I/O being performed, if it does client has a valid layout for the I/O being performed, if it does
not, the I/O is rejected. Before the metadata server takes any not, the I/O is rejected. Before the metadata server takes any
action to invalidate a layout given out by a previous instance, it action to invalidate a layout given out by a previous instance, it
must make sure that all layouts from that previous instance are must make sure that all layouts from that previous instance are
skipping to change at page 240, line 34 skipping to change at page 250, line 15
the layout itself must be invalidated. the layout itself must be invalidated.
This means that a metadata server may not restripe a file until it This means that a metadata server may not restripe a file until it
has contacted all of the storage devices to invalidate the layouts has contacted all of the storage devices to invalidate the layouts
from the previous instance nor may it give out locks that conflict from the previous instance nor may it give out locks that conflict
with locks embodied by the stateids associated with any layout from with locks embodied by the stateids associated with any layout from
the previous instance without either doing a specific invalidation the previous instance without either doing a specific invalidation
(as it would have to do anyway) or doing a global storage device (as it would have to do anyway) or doing a global storage device
invalidation. invalidation.
12.4.7. Security Considerations for the File Layout Type 12.4.8. Security Considerations for the File Layout Type
The NFSv4 file layout type MUST adhere to the security considerations The NFSv4.1 file layout type MUST adhere to the security
outlined in Section 12.3.8. More specifically, storage devices must considerations outlined in Section 12.3.8. More specifically,
make all of the required access checks on each READ or WRITE I/O as storage devices must make all of the required access checks on each
determined by the NFSv4 protocol RFC3530 [2]. If the metadata server READ or WRITE I/O as determined by the [[Comment.15: get rid of
would deny an operation on a given file due its ACL, mode attribute, references tro RFC3530]]NFSv4 protocol RFC3530 [2]. If the metadata
open mode, open deny mode, mandatory lock state, or any other server would deny an operation on a given file due its ACL, mode
attributes and state, the data server MUST also deny the operation. attribute, open mode, open deny mode, mandatory lock state, or any
This impacts the control protocol and the propagation of state from other attributes and state, the data server MUST also deny the
the metadata server to the storage devices; see Section 12.4.4 for operation. This impacts the control protocol and the propagation of
more details. state from the metadata server to the storage devices; see
Section 12.4.5 for more details.
The methods for authentication, integrity, and privacy for file The methods for authentication, integrity, and privacy for file
layout-based data servers are the same as that used for metadata layout-based data servers are the same as that used for metadata
servers. Metadata and data servers use ONC RPC security flavors to servers. Metadata and data servers use ONC RPC security flavors to
authenticate, and SECINFO and SECINFO_NO_NAME to negotiate the authenticate, and SECINFO and SECINFO_NO_NAME to negotiate the
security mechanism and services to be used. security mechanism and services to be used.
For a given file object, a metadata server MAY require different For a given file object, a metadata server MAY require different
security parameters (secinfo4 value) than the data server. For a security parameters (secinfo4 value) than the data server. For a
given file object with multiple data servers, the secinfo4 value given file object with multiple data servers, the secinfo4 value
SHOULD be the same across all data servers. SHOULD be the same across all data servers.
If an NFSv4.1 implementation supports parallel NFS and supports file If an NFSv4.1 implementation supports parallel NFS and supports file
layouts, then the implementation MUST support the SECINFO_NO_NAME layouts, then the implementation MUST support the SECINFO_NO_NAME
operation, on both the metadata and data servers. operation, on both the metadata and data servers.
12.4.8. Alternate Approaches 12.4.9. Alternate Approaches
Two alternate approaches exist for file-based layouts and the method Two alternate approaches exist for file-based layouts and the method
used by clients to obtain stateids used for I/O. Both approaches used by clients to obtain stateids used for I/O. Both approaches
embed stateids within the layout. embed stateids within the layout.
However, before examining these approaches it is important to However, before examining these approaches it is important to
understand the distinction between clients and owners. Delegations understand the distinction between clients and owners. Delegations
belong to clients, while locks (e.g., record and share reservations) belong to clients, while locks (e.g., record and share reservations)
are held by owners which in turn belong to a specific client. As are held by owners which in turn belong to a specific client. As
such, delegations can only protect against inter-client conflicts, such, delegations can only protect against inter-client conflicts,
skipping to change at page 242, line 12 skipping to change at page 251, line 43
layout is a special stateid of all zeros, then the stateid referring layout is a special stateid of all zeros, then the stateid referring
to the last successful OPEN/LOCK should be used. This approach is to the last successful OPEN/LOCK should be used. This approach is
recommended if it is decided that using NFSv4 as a control protocol recommended if it is decided that using NFSv4 as a control protocol
is required. is required.
This proposal suggests the global stateid approach due to the cleaner This proposal suggests the global stateid approach due to the cleaner
semantics it provides regarding the relationship between stateids semantics it provides regarding the relationship between stateids
used for I/O and their corresponding open instance or lock state. used for I/O and their corresponding open instance or lock state.
However, it does have a profound impact on the control protocol's However, it does have a profound impact on the control protocol's
implementation and the state propagation that is required (as implementation and the state propagation that is required (as
described in Section 12.4.4). described in Section 12.4.5).
13. Internationalization 13. Internationalization
The primary issue in which NFS version 4 needs to deal with The primary issue in which NFS version 4 needs to deal with
internationalization, or I18N, is with respect to file names and internationalization, or I18N, is with respect to file names and
other strings as used within the protocol. The choice of string other strings as used within the protocol. The choice of string
representation must allow reasonable name/string access to clients representation must allow reasonable name/string access to clients
which use various languages. The UTF-8 encoding of the UCS as which use various languages. The UTF-8 encoding of the UCS as
defined by ISO10646 [10] allows for this type of access and follows defined by ISO10646 [10] allows for this type of access and follows
the policy described in "IETF Policy on Character Sets and the policy described in "IETF Policy on Character Sets and
skipping to change at page 243, line 17 skipping to change at page 252, line 48
o The tables from stringprep listing of characters that are o The tables from stringprep listing of characters that are
prohibited as output (as described in section 5 of stringprep) prohibited as output (as described in section 5 of stringprep)
o The bidirectional string testing used, if any (as described in o The bidirectional string testing used, if any (as described in
section 6 of stringprep) section 6 of stringprep)
o Any additional characters that are prohibited as output specific o Any additional characters that are prohibited as output specific
to the profile to the profile
Stringprep discusses Unicode characters, whereas NFS version 4 Stringprep discusses Unicode characters, whereas NFS version 4
renders UTF-8 characters. Since there is a one to one mapping from renders UTF-8 characters. Since there is a one-to-one mapping from
UTF-8 to Unicode, where ever the remainder of this document refers to UTF-8 to Unicode, when the remainder of this document refers to
to Unicode, the reader should assume UTF-8. Unicode, the reader should assume UTF-8.
Much of the text for the profiles comes from RFC3491 [13]. Much of the text for the profiles comes from RFC3491 [13].
13.1. Stringprep profile for the utf8str_cs type 13.1. Stringprep profile for the utf8str_cs type
Every use of the utf8str_cs type definition in the NFS version 4 Every use of the utf8str_cs type definition in the NFS version 4
protocol specification follows the profile named nfs4_cs_prep. protocol specification follows the profile named nfs4_cs_prep.
13.1.1. Intended applicability of the nfs4_cs_prep profile 13.1.1. Intended applicability of the nfs4_cs_prep profile
skipping to change at page 250, line 45 skipping to change at page 260, line 22
| | | (either current or | | | | (either current or |
| | | superseded) for a | | | | superseded) for a |
| | | current | | | | current |
| | | lockowner-file pair, | | | | lockowner-file pair, |
| | | was used. | | | | was used. |
| NFS4ERR_BADXDR | 10036 | The server | | NFS4ERR_BADXDR | 10036 | The server |
| | | encountered an XDR | | | | encountered an XDR |
| | | decoding error while | | | | decoding error while |
| | | processing an | | | | processing an |
| | | operation. | | | | operation. |
| NFS4ERR_CLID_INUSE | 10017 | The CREATE_CLIENTID | | NFS4ERR_CLID_INUSE | 10017 | The EXCHANGE_ID |
| | | operation has found | | | | operation has found |
| | | that a client id is | | | | that a client id is |
| | | already in use by | | | | already in use by |
| | | another client. | | | | another client. |
| NFS4ERR_COMPLETE_ALREADY | 10054 | A RECLAIM_COMPLETE | | NFS4ERR_COMPLETE_ALREADY | 10054 | A RECLAIM_COMPLETE |
| | | operation was done | | | | operation was done |
| | | by a client which | | | | by a client which |
| | | had already | | | | had already |
| | | performed one. | | | | performed one. |
| NFS4ERR_CONN_NOT_BOUND_TO_SESSION | 10055 | The connection is | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | 10055 | The connection is |
skipping to change at page 261, line 35 skipping to change at page 270, line 35
| | | limit set by the | | | | limit set by the |
| | | initial request. | | | | initial request. |
| NFS4ERR_TOO_MANY_OPS | 10070 | The COMPOUND or | | NFS4ERR_TOO_MANY_OPS | 10070 | The COMPOUND or |
| | | CB_COMPOUND request | | | | CB_COMPOUND request |
| | | has too many | | | | has too many |
| | | operations. | | | | operations. |
| NFS4ERR_UNKNOWN_LAYOUTTYPE | 10062 | Layout type is | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 10062 | Layout type is |
| | | unknown. | | | | unknown. |
| NFS4ERR_UNSAFE_COMPOUND | 10069 | The client has sent | | NFS4ERR_UNSAFE_COMPOUND | 10069 | The client has sent |
| | | a COMPOUND request | | | | a COMPOUND request |
| | | with an usafe mix of | | | | with an unsafe mix |
| | | operations. | | | | of operations. |
| NFS4ERR_WRONGSEC | 10016 | The security | | NFS4ERR_WRONGSEC | 10016 | The security |
| | | mechanism being used | | | | mechanism being used |
| | | by the client for | | | | by the client for |
| | | the operation does | | | | the operation does |
| | | not match the | | | | not match the |
| | | server's security | | | | server's security |
| | | policy. The client | | | | policy. The client |
| | | should change the | | | | should change the |
| | | security mechanism | | | | security mechanism |
| | | being used and retry | | | | being used and retry |
skipping to change at page 263, line 19 skipping to change at page 272, line 19
| | NFS4ERR_EXIST, NFS4ERR_FHEXPIRED, | | | NFS4ERR_EXIST, NFS4ERR_FHEXPIRED, |
| | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, |
| | NFS4ERR_NAMETOOLONG, NFS4ERR_NOFILEHANDLE, | | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOFILEHANDLE, |
| | NFS4ERR_NOSPC, NFS4ERR_NOTDIR, | | | NFS4ERR_NOSPC, NFS4ERR_NOTDIR, |
| | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PERM, | | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PERM, |
| | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, |
| | NFS4ERR_REP_TOO_BIG, | | | NFS4ERR_REP_TOO_BIG, |
| | NFS4ERR_REP_TOO_BIG_TO_CACHE, | | | NFS4ERR_REP_TOO_BIG_TO_CACHE, |
| | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, |
| | NFS4ERR_SERVERFAULT, NFS4ERR_STALE | | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE |
| CREATE_CLIENTID | | | EXCHANGE_ID | |
| CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | | CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, |
| | NFS4ERR_SERVERFAULT, | | | NFS4ERR_SERVERFAULT, |
| | NFS4ERR_STALE_CLIENTID | | | NFS4ERR_STALE_CLIENTID |
| DELEGPURGE | NFS4ERR_BADXDR, NFS4ERR_NOTSUPP, | | DELEGPURGE | NFS4ERR_BADXDR, NFS4ERR_NOTSUPP, |
| | NFS4ERR_LEASE_MOVED, NFS4ERR_MOVED, | | | NFS4ERR_LEASE_MOVED, NFS4ERR_MOVED, |
| | NFS4ERR_OP_NOT_IN_SESSION, | | | NFS4ERR_OP_NOT_IN_SESSION, |
| | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, |
| | NFS4ERR_REP_TOO_BIG, | | | NFS4ERR_REP_TOO_BIG, |
| | NFS4ERR_REP_TOO_BIG_TO_CACHE, | | | NFS4ERR_REP_TOO_BIG_TO_CACHE, |
| | NFS4ERR_UNSAFE_COMPOUND, | | | NFS4ERR_UNSAFE_COMPOUND, |
skipping to change at page 273, line 22 skipping to change at page 282, line 22
| | NFS4ERR_LOCKED, NFS4ERR_MOVED, | | | NFS4ERR_LOCKED, NFS4ERR_MOVED, |
| | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, |
| | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, |
| | NFS4ERR_PERM, NFS4ERR_OP_NOT_IN_SESSION, | | | NFS4ERR_PERM, NFS4ERR_OP_NOT_IN_SESSION, |
| | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, | | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_TOO_MANY_OPS, |
| | NFS4ERR_REP_TOO_BIG, | | | NFS4ERR_REP_TOO_BIG, |
| | NFS4ERR_REP_TOO_BIG_TO_CACHE, | | | NFS4ERR_REP_TOO_BIG_TO_CACHE, |
| | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, | | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_ROFS, |
| | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, |
| | NFS4ERR_STALE_STATEID | | | NFS4ERR_STALE_STATEID |
| CREATE_CLIENTID | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | | EXCHANGE_ID | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, |
| | NFS4ERR_INVAL, NFS4ERR_SERVERFAULT | | | NFS4ERR_INVAL, NFS4ERR_SERVERFAULT |
| CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | | CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, |
| | NFS4ERR_DELAY, NFS4ERR_SERVERFAULT, | | | NFS4ERR_DELAY, NFS4ERR_SERVERFAULT, |
| | NFS4ERR_STALE_CLIENTID | | | NFS4ERR_STALE_CLIENTID |
| VERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | | VERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, |
| | NFS4ERR_BADCHAR, NFS4ERR_BADHANDLE, | | | NFS4ERR_BADCHAR, NFS4ERR_BADHANDLE, |
| | NFS4ERR_BADXDR, NFS4ERR_DELAY, | | | NFS4ERR_BADXDR, NFS4ERR_DELAY, |
| | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, |
| | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, |
| | NFS4ERR_NOT_SAME, | | | NFS4ERR_NOT_SAME, |
skipping to change at page 277, line 7 skipping to change at page 286, line 7
| NFS4ERR_BADNAME | CREATE, LINK, LOOKUP, OPEN, | | NFS4ERR_BADNAME | CREATE, LINK, LOOKUP, OPEN, |
| | REMOVE, RENAME, SECINFO, | | | REMOVE, RENAME, SECINFO, |
| | SECINFO_NO_NAME | | | SECINFO_NO_NAME |
| NFS4ERR_BADOWNER | CREATE, OPEN, SETATTR | | NFS4ERR_BADOWNER | CREATE, OPEN, SETATTR |
| NFS4ERR_BADSESSION | CB_SEQUENCE, SEQUENCE | | NFS4ERR_BADSESSION | CB_SEQUENCE, SEQUENCE |
| NFS4ERR_BADSLOT | CB_SEQUENCE, SEQUENCE | | NFS4ERR_BADSLOT | CB_SEQUENCE, SEQUENCE |
| NFS4ERR_BADTYPE | CREATE | | NFS4ERR_BADTYPE | CREATE |
| NFS4ERR_BADXDR | ACCESS, CB_GETATTR, | | NFS4ERR_BADXDR | ACCESS, CB_GETATTR, |
| | CB_NOTIFY, CB_RECALL, CLOSE, | | | CB_NOTIFY, CB_RECALL, CLOSE, |
| | COMMIT, CREATE, | | | COMMIT, CREATE, |
| | CREATE_CLIENTID, |
| | CREATE_SESSION, DELEGPURGE, | | | CREATE_SESSION, DELEGPURGE, |
| | DELEGRETURN, GETATTR, | | | DELEGRETURN, EXCHANGE_ID, |
| | GET_DIR_DELEGATION, LINK, | | | GETATTR, GET_DIR_DELEGATION, |
| | LOCK, LOCKT, LOCKU, LOOKUP, | | | LINK, LOCK, LOCKT, LOCKU, |
| | NVERIFY, OPEN, OPENATTR, | | | LOOKUP, NVERIFY, OPEN, |
| | OPEN_DOWNGRADE, PUTFH, READ, | | | OPENATTR, OPEN_DOWNGRADE, |
| | READDIR, RELEASE_LOCKOWNER, | | | PUTFH, READ, READDIR, |
| | REMOVE, RENAME, SECINFO, | | | RELEASE_LOCKOWNER, REMOVE, |
| | RENAME, SECINFO, |
| | SECINFO_NO_NAME, SETATTR, | | | SECINFO_NO_NAME, SETATTR, |
| | VERIFY, WRITE | | | VERIFY, WRITE |
| NFS4ERR_BAD_COOKIE | GETDEVICELIST, READDIR | | NFS4ERR_BAD_COOKIE | GETDEVICELIST, READDIR |
| NFS4ERR_BAD_RANGE | LOCK, LOCKT, LOCKU | | NFS4ERR_BAD_RANGE | LOCK, LOCKT, LOCKU |
| NFS4ERR_BAD_SEQID | CLOSE, LOCK, LOCKU, OPEN, | | NFS4ERR_BAD_SEQID | CLOSE, LOCK, LOCKU, OPEN, |
| | OPEN_DOWNGRADE | | | OPEN_DOWNGRADE |
| NFS4ERR_BAD_SESSION_DIGEST | BIND_CONN_TO_SESSION, SET_SSV | | NFS4ERR_BAD_SESSION_DIGEST | BIND_CONN_TO_SESSION, SET_SSV |
| NFS4ERR_BAD_STATEID | CB_NOTIFY, CB_RECALL, CLOSE, | | NFS4ERR_BAD_STATEID | CB_NOTIFY, CB_RECALL, CLOSE, |
| | DELEGRETURN, LOCK, LOCKU, | | | DELEGRETURN, LOCK, LOCKU, |
| | OPEN_DOWNGRADE, READ, | | | OPEN_DOWNGRADE, READ, |
| | SETATTR, WRITE | | | SETATTR, WRITE |
| NFS4ERR_CLID_INUSE | CREATE_CLIENTID, | | NFS4ERR_CLID_INUSE | CREATE_SESSION, EXCHANGE_ID |
| | CREATE_SESSION |
| NFS4ERR_COMPLETE_ALREADY | RECLAIM_COMPLETE | | NFS4ERR_COMPLETE_ALREADY | RECLAIM_COMPLETE |
| NFS4ERR_CONN_BINDING_NOT_ENFORCED | BIND_CONN_TO_SESSION, SET_SSV | | NFS4ERR_CONN_BINDING_NOT_ENFORCED | BIND_CONN_TO_SESSION, SET_SSV |
| NFS4ERR_CONN_NOT_BOUND_TO_SESSION | CB_SEQUENCE, SEQUENCE | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | CB_SEQUENCE, SEQUENCE |
| NFS4ERR_DEADLOCK | LOCK | | NFS4ERR_DEADLOCK | LOCK |
| NFS4ERR_DELAY | ACCESS, CLOSE, CREATE, | | NFS4ERR_DELAY | ACCESS, CLOSE, CREATE, |
| | CREATE_SESSION, GETATTR, | | | CREATE_SESSION, GETATTR, |
| | LINK, LOCK, LOCKT, NVERIFY, | | | LINK, LOCK, LOCKT, NVERIFY, |
| | OPEN, OPENATTR, READ, | | | OPEN, OPENATTR, READ, |
| | READDIR, READLINK, REMOVE, | | | READDIR, READLINK, REMOVE, |
| | RENAME, SETATTR, VERIFY, | | | RENAME, SETATTR, VERIFY, |
skipping to change at page 278, line 23 skipping to change at page 287, line 23
| | PUTFH, READ, READDIR, | | | PUTFH, READ, READDIR, |
| | READLINK, REMOVE, RENAME, | | | READLINK, REMOVE, RENAME, |
| | RESTOREFH, SAVEFH, SECINFO, | | | RESTOREFH, SAVEFH, SECINFO, |
| | SECINFO_NO_NAME, SETATTR, | | | SECINFO_NO_NAME, SETATTR, |
| | VERIFY, WRITE | | | VERIFY, WRITE |
| NFS4ERR_FILE_OPEN | LINK, REMOVE, RENAME | | NFS4ERR_FILE_OPEN | LINK, REMOVE, RENAME |
| NFS4ERR_GRACE | LOCK, LOCKT, LOCKU, OPEN, | | NFS4ERR_GRACE | LOCK, LOCKT, LOCKU, OPEN, |
| | READ, SETATTR, WRITE | | | READ, SETATTR, WRITE |
| NFS4ERR_INVAL | ACCESS, CB_NOTIFY, | | NFS4ERR_INVAL | ACCESS, CB_NOTIFY, |
| | CB_RECALL_ANY, CLOSE, COMMIT, | | | CB_RECALL_ANY, CLOSE, COMMIT, |
| | CREATE, CREATE_CLIENTID, | | | CREATE, DELEGRETURN, |
| | DELEGRETURN, GETATTR, | | | EXCHANGE_ID, GETATTR, |
| | GETDEVICEINFO, GETDEVICELIST, | | | GETDEVICEINFO, GETDEVICELIST, |
| | GET_DIR_DELEGATION, | | | GET_DIR_DELEGATION, |
| | LAYOUTCOMMIT, LAYOUTGET, | | | LAYOUTCOMMIT, LAYOUTGET, |
| | LAYOUTRETURN, LINK, LOCK, | | | LAYOUTRETURN, LINK, LOCK, |
| | LOCKT, LOCKU, LOOKUP, | | | LOCKT, LOCKU, LOOKUP, |
| | NVERIFY, OPEN, | | | NVERIFY, OPEN, |
| | OPEN_DOWNGRADE, READ, | | | OPEN_DOWNGRADE, READ, |
| | READDIR, READLINK, REMOVE, | | | READDIR, READLINK, REMOVE, |
| | RENAME, SECINFO, | | | RENAME, SECINFO, |
| | SECINFO_NO_NAME, SETATTR, | | | SECINFO_NO_NAME, SETATTR, |
skipping to change at page 282, line 7 skipping to change at page 291, line 7
| NFS4ERR_RESTOREFH | RESTOREFH | | NFS4ERR_RESTOREFH | RESTOREFH |
| NFS4ERR_ROFS | COMMIT, CREATE, LINK, OPEN, | | NFS4ERR_ROFS | COMMIT, CREATE, LINK, OPEN, |
| | OPENATTR, REMOVE, RENAME, | | | OPENATTR, REMOVE, RENAME, |
| | SETATTR, WRITE | | | SETATTR, WRITE |
| NFS4ERR_SAME | NVERIFY | | NFS4ERR_SAME | NVERIFY |
| NFS4ERR_SEQUENCE_POS | CB_SEQUENCE, SEQUENCE | | NFS4ERR_SEQUENCE_POS | CB_SEQUENCE, SEQUENCE |
| NFS4ERR_SEQ_MISORDERED | CB_SEQUENCE, SEQUENCE | | NFS4ERR_SEQ_MISORDERED | CB_SEQUENCE, SEQUENCE |
| NFS4ERR_SERVERFAULT | ACCESS, CB_GETATTR, | | NFS4ERR_SERVERFAULT | ACCESS, CB_GETATTR, |
| | CB_NOTIFY, CB_RECALL, CLOSE, | | | CB_NOTIFY, CB_RECALL, CLOSE, |
| | COMMIT, CREATE, | | | COMMIT, CREATE, |
| | CREATE_CLIENTID, |
| | CREATE_SESSION, DELEGPURGE, | | | CREATE_SESSION, DELEGPURGE, |
| | DELEGRETURN, GETATTR, GETFH, | | | DELEGRETURN, EXCHANGE_ID, |
| | GETATTR, GETFH, |
| | GET_DIR_DELEGATION, LINK, | | | GET_DIR_DELEGATION, LINK, |
| | LOCK, LOCKT, LOCKU, LOOKUP, | | | LOCK, LOCKT, LOCKU, LOOKUP, |
| | LOOKUPP, NVERIFY, OPEN, | | | LOOKUPP, NVERIFY, OPEN, |
| | OPENATTR, OPEN_DOWNGRADE, | | | OPENATTR, OPEN_DOWNGRADE, |
| | PUTFH, PUTPUBFH, PUTROOTFH, | | | PUTFH, PUTPUBFH, PUTROOTFH, |
| | READ, READDIR, READLINK, | | | READ, READDIR, READLINK, |
| | RELEASE_LOCKOWNER, REMOVE, | | | RELEASE_LOCKOWNER, REMOVE, |
| | RENAME, RESTOREFH, SAVEFH, | | | RENAME, RESTOREFH, SAVEFH, |
| | SECINFO, SECINFO_NO_NAME, | | | SECINFO, SECINFO_NO_NAME, |
| | SETATTR, VERIFY, WRITE | | | SETATTR, VERIFY, WRITE |
skipping to change at page 286, line 43 skipping to change at page 295, line 43
The definition of the "tag" in the request is left to the The definition of the "tag" in the request is left to the
implementor. It may be used to summarize the content of the compound implementor. It may be used to summarize the content of the compound
request for the benefit of packet sniffers and engineers debugging request for the benefit of packet sniffers and engineers debugging
implementations. However, the value of "tag" in the response SHOULD implementations. However, the value of "tag" in the response SHOULD
be the same value as provided in the request. This applies to the be the same value as provided in the request. This applies to the
tag field of the CB_COMPOUND procedure as well. tag field of the CB_COMPOUND procedure as well.
15.2.4.1. Current File Handle and Stateid 15.2.4.1. Current File Handle and Stateid
The COMPOUND procedure offers a simple environment for the execution The COMPOUND procedure offers a simple environment for the execution
of the operations specified by the clinet. The first two relate to of the operations specified by the client. The first two relate to
the file handle while the second two relate to the current stateid. the file handle while the second two relate to the current stateid.
15.2.4.1.1. Current File Handle 15.2.4.1.1. Current File Handle
The current and saved file handle are used throughout the protocol. The current and saved file handle are used throughout the protocol.
Most operations implicitly use the current file handle as a argument Most operations implicitly use the current file handle as a argument
and many set the current file handle as part of the results. The and many set the current file handle as part of the results. The
combination of client specified sequences of operations and current combination of client specified sequences of operations and current
and saved file handle arguments and results allows for greater and saved file handle arguments and results allows for greater
protocol flexibility. The best or easiest example of current file protocol flexibility. The best or easiest example of current file
skipping to change at page 287, line 16 skipping to change at page 296, line 16
PUTFH fh1 {fh1} PUTFH fh1 {fh1}
LOOKUP "compA" {fh2} LOOKUP "compA" {fh2}
GETATTR {fh2} GETATTR {fh2}
LOOKUP "compB" {fh3} LOOKUP "compB" {fh3}
GETATTR {fh3} GETATTR {fh3}
LOOKUP "compC" {fh4} LOOKUP "compC" {fh4}
GETATTR {fh4} GETATTR {fh4}
GETFH GETFH
Figure 71 Figure 72
In this example, the PUTFH operation explicitly sets the current file In this example, the PUTFH operation explicitly sets the current file
handle value while the result of each LOOKUP operation sets the handle value while the result of each LOOKUP operation sets the
current file handle value to the resultant file system object. Also, current file handle value to the resultant file system object. Also,
the client is able to insert GETATTR operations using the current the client is able to insert GETATTR operations using the current
file handle as an argument. file handle as an argument.
Along with the current file handle, there is a saved file handle. Along with the current file handle, there is a saved file handle.
While the current file handle is set as the result of operations like While the current file handle is set as the result of operations like
LOOKUP, the saved file handle must be set directly with the use of LOOKUP, the saved file handle must be set directly with the use of
skipping to change at page 288, line 22 skipping to change at page 297, line 22
current stateid. current stateid.
The following example is the common case of a simple READ operation The following example is the common case of a simple READ operation
with a supplied stateid showing that the PUTFH initializes the with a supplied stateid showing that the PUTFH initializes the
current stateid to zero. The subsequent READ with stateid sid1 current stateid to zero. The subsequent READ with stateid sid1
replaces the current stateid before evaluating the operation. replaces the current stateid before evaluating the operation.
PUTFH fh1 - -> {fh1, 0} PUTFH fh1 - -> {fh1, 0}
READ sid1,0,1024 {fh1, sid1} -> {fh1, sid1} READ sid1,0,1024 {fh1, sid1} -> {fh1, sid1}
Figure 72 Figure 73
This next example performs an OPEN with the client provided stateid This next example performs an OPEN with the client provided stateid
sid1 and as a result generates stateid sid2. The next operation sid1 and as a result generates stateid sid2. The next operation
specifies the READ with the special all-zero stateid but the current specifies the READ with the special all-zero stateid but the current
stateid set by the previous operation is actually used when the stateid set by the previous operation is actually used when the
operation is evaluated, allowing correct interaction with any operation is evaluated, allowing correct interaction with any
existing, potentially conflicting, locks. existing, potentially conflicting, locks.
PUTFH fh1 - -> {fh1, 0} PUTFH fh1 - -> {fh1, 0}
OPEN R,sid1,"compA" {fh1, sid1} -> {fh2, sid2} OPEN R,sid1,"compA" {fh1, sid1} -> {fh2, sid2}
READ 0,0,1024 {fh2, sid2} -> {fh2, sid2} READ 0,0,1024 {fh2, sid2} -> {fh2, sid2}
CLOSE 0 {fh2, sid2} -> {fh2, sid3} CLOSE 0 {fh2, sid2} -> {fh2, sid3}
Figure 73 Figure 74
The final example is similar to the second in how it passes the The final example is similar to the second in how it passes the
stateid sid2 generated by the LOCK operation to the next READ stateid sid2 generated by the LOCK operation to the next READ
operation. This allows the client to explicitly surround a single operation. This allows the client to explicitly surround a single
I/O operation with a lock and its appropriate stateid to guarantee I/O operation with a lock and its appropriate stateid to guarantee
correctness with other client locks. correctness with other client locks.
PUTFH fh1 - -> {fh1, 0} PUTFH fh1 - -> {fh1, 0}
LOCK W,0,1024,sid1 {fh1, sid1} -> {fh1, sid2} LOCK W,0,1024,sid1 {fh1, sid1} -> {fh1, sid2}
READ 0,0,1024 {fh1, sid2} -> {fh1, sid2} READ 0,0,1024 {fh1, sid2} -> {fh1, sid2}
LOCKU W,0,1024,0 {fh1, sid2} -> {fh1, sid3} LOCKU W,0,1024,0 {fh1, sid2} -> {fh1, sid3}
Figure 74 Figure 75
15.2.5. IMPLEMENTATION 15.2.5. IMPLEMENTATION
15.2.6. ERRORS 15.2.6. ERRORS
All errors defined in the protocol All errors defined in the protocol
16. NFS version 4.1 Operations 16. NFS version 4.1 Operations
16.1. Operation 3: ACCESS - Check Access Rights 16.1. Operation 3: ACCESS - Check Access Rights
skipping to change at page 312, line 37 skipping to change at page 321, line 37
directory, the error NFS4ERR_NOTDIR is returned. directory, the error NFS4ERR_NOTDIR is returned.
If the requester's security flavor does not match that configured for If the requester's security flavor does not match that configured for
the parent directory, then the server SHOULD return NFS4ERR_WRONGSEC the parent directory, then the server SHOULD return NFS4ERR_WRONGSEC
(a future minor revision of NFSv4 may upgrade this to MUST) in the (a future minor revision of NFSv4 may upgrade this to MUST) in the
LOOKUPP response. However, if the server does so, it MUST support LOOKUPP response. However, if the server does so, it MUST support
the new SECINFO_NO_NAME operation, so that the client can gracefully the new SECINFO_NO_NAME operation, so that the client can gracefully
determine the correct security flavor. See the discussion of the determine the correct security flavor. See the discussion of the
SECINFO_NO_NAME operation for a description. SECINFO_NO_NAME operation for a description.
If the current filehandle is a named attribute directory that is
associated with a filesystem object via OPENATTR (i.e. not a sub-
directory of a named attribute directory) LOOKUPP SHOULD return the
filehandle of the associated filesystem object.
16.14.5. IMPLEMENTATION 16.14.5. IMPLEMENTATION