draft-ietf-nfsv4-minorversion1-01.txt   draft-ietf-nfsv4-minorversion1-02.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft Editor Internet-Draft Editor
Expires: June 15, 2006 December 12, 2005 Intended status: Standards Track March 6, 2006
Expires: September 7, 2006
NFSv4 Minor Version 1 NFSv4 Minor Version 1
draft-ietf-nfsv4-minorversion1-01.txt draft-ietf-nfsv4-minorversion1-02.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 33 skipping to change at page 1, line 34
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on June 15, 2006. This Internet-Draft will expire on September 7, 2006.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2005). Copyright (C) The Internet Society (2006).
Abstract Abstract
This Internet-Draft describes the NFSv4 minor version 1 protocol This Internet-Draft describes the NFSv4 minor version 1 protocol
extensions. These most significant of these extensions are commonly extensions. These most significant of these extensions are commonly
called: Sessions, Directory Delegations, and parallel NFS or pNFS called: Sessions, Directory Delegations, and parallel NFS or pNFS
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Table of Contents Table of Contents
1. Security Negotiation . . . . . . . . . . . . . . . . . . . . 6 1. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 9
2. Clarification of Security Negotiation in NFSv4.1 . . . . . . 6 1.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 9
2.1 PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . . 6 1.2. Structured Data Types . . . . . . . . . . . . . . . . . 10
2.2 PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . . 7 2. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . . 7 2.1. Obtaining the First Filehandle . . . . . . . . . . . . . 19
2.4 PUTFH + Anything Else . . . . . . . . . . . . . . . . . . 7 2.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 20
3. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 20
3.1 Sessions Background . . . . . . . . . . . . . . . . . . . 8 2.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Introduction to Sessions . . . . . . . . . . . . . . . 8 2.2.1. General Properties of a Filehandle . . . . . . . . . 21
3.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 22
3.1.3 Problem Statement . . . . . . . . . . . . . . . . . . 10 2.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 22
3.1.4 NFSv4 Session Extension Characteristics . . . . . . . 11 2.3. One Method of Constructing a Volatile Filehandle . . . . 23
3.2 Transport Issues . . . . . . . . . . . . . . . . . . . . . 12 2.4. Client Recovery from Filehandle Expiration . . . . . . . 24
3.2.1 Session Model . . . . . . . . . . . . . . . . . . . . 12 3. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Connection State . . . . . . . . . . . . . . . . . . . 13 3.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 26
3.2.3 NFSv4 Channels, Sessions and Connections . . . . . . . 14 3.2. Recommended Attributes . . . . . . . . . . . . . . . . . 26
3.2.4 Reconnection, Trunking and Failover . . . . . . . . . 16 3.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 27
3.2.5 Server Duplicate Request Cache . . . . . . . . . . . . 17 3.4. Classification of Attributes . . . . . . . . . . . . . . 27
3.3 Session Initialization and Transfer Models . . . . . . . . 18 3.5. Mandatory Attributes - Definitions . . . . . . . . . . . 28
3.3.1 Session Negotiation . . . . . . . . . . . . . . . . . 18 3.6. Recommended Attributes - Definitions . . . . . . . . . . 30
3.3.2 RDMA Requirements . . . . . . . . . . . . . . . . . . 19 3.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 RDMA Connection Resources . . . . . . . . . . . . . . 20 3.8. Interpreting owner and owner_group . . . . . . . . . . . 38
3.3.4 TCP and RDMA Inline Transfer Model . . . . . . . . . . 21 3.9. Character Case Attributes . . . . . . . . . . . . . . . 40
3.3.5 RDMA Direct Transfer Model . . . . . . . . . . . . . . 23 3.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 40
3.4 Connection Models . . . . . . . . . . . . . . . . . . . . 26 3.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 41
3.4.1 TCP Connection Model . . . . . . . . . . . . . . . . . 27 3.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 42
3.4.2 Negotiated RDMA Connection Model . . . . . . . . . . . 28 3.13. fs_layouttype . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 Automatic RDMA Connection Model . . . . . . . . . . . 29 3.14. layouttype . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Buffer Management, Transfer, Flow Control . . . . . . . . 29 3.15. layouthint . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Retry and Replay . . . . . . . . . . . . . . . . . . . . . 32 3.16. Access Control Lists . . . . . . . . . . . . . . . . . . 43
3.7 The Back Channel . . . . . . . . . . . . . . . . . . . . . 33 3.16.1. ACE type . . . . . . . . . . . . . . . . . . . . . . 45
3.8 COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . . 34 3.16.2. ACE Access Mask . . . . . . . . . . . . . . . . . . 46
3.9 Data Alignment . . . . . . . . . . . . . . . . . . . . . . 34 3.16.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . 51
3.10 NFSv4 Integration . . . . . . . . . . . . . . . . . . . 36 3.16.4. ACE who . . . . . . . . . . . . . . . . . . . . . . 53
3.10.1 Minor Versioning . . . . . . . . . . . . . . . . . . 36 3.16.5. Mode Attribute . . . . . . . . . . . . . . . . . . . 54
3.10.2 Slot Identifiers and Server Duplicate Request 3.16.6. Interaction Between Mode and ACL Attributes . . . . 55
Cache . . . . . . . . . . . . . . . . . . . . . . . 36 4. Filesystem Migration and Replication . . . . . . . . . . . . 69
3.10.3 COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 40 4.1. Replication . . . . . . . . . . . . . . . . . . . . . . 69
3.10.4 eXternal Data Representation Efficiency . . . . . . 41 4.2. Migration . . . . . . . . . . . . . . . . . . . . . . . 70
3.10.5 Effect of Sessions on Existing Operations . . . . . 41 4.3. Interpretation of the fs_locations Attribute . . . . . . 70
3.10.6 Authentication Efficiencies . . . . . . . . . . . . 42 4.4. Filehandle Recovery for Migration or Replication . . . . 72
3.11 Sessions Security Considerations . . . . . . . . . . . . 43 5. NFS Server Name Space . . . . . . . . . . . . . . . . . . . . 72
3.11.1 Authentication . . . . . . . . . . . . . . . . . . . 44 5.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 72
4. Directory Delegations . . . . . . . . . . . . . . . . . . . 45 5.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 72
4.1 Introduction to Directory Delegations . . . . . . . . . . 45 5.3. Server Pseudo Filesystem . . . . . . . . . . . . . . . . 73
4.2 Directory Delegation Design (in brief) . . . . . . . . . . 47 5.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 73
4.3 Recommended Attributes in support of Directory 5.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 74
Delegations . . . . . . . . . . . . . . . . . . . . . . . 48 5.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 74
4.4 Delegation Recall . . . . . . . . . . . . . . . . . . . . 48 5.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 74
4.5 Delegation Recovery . . . . . . . . . . . . . . . . . . . 49 5.8. Security Policy and Name Space Presentation . . . . . . 75
5. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 49 6. File Locking and Share Reservations . . . . . . . . . . . . . 76
6. General Definitions . . . . . . . . . . . . . . . . . . . . 51 6.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 76
6.1 Metadata Server . . . . . . . . . . . . . . . . . . . . . 52 6.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . 77
6.2 Client . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.1.2. Server Release of Clientid . . . . . . . . . . . . . 79
6.3 Storage Device . . . . . . . . . . . . . . . . . . . . . . 52 6.1.3. lock_owner and stateid Definition . . . . . . . . . 80
6.4 Storage Protocol . . . . . . . . . . . . . . . . . . . . . 52 6.1.4. Use of the stateid and Locking . . . . . . . . . . . 82
6.5 Control Protocol . . . . . . . . . . . . . . . . . . . . . 53 6.1.5. Sequencing of Lock Requests . . . . . . . . . . . . 84
6.6 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.1.6. Recovery from Replayed Requests . . . . . . . . . . 85
6.7 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.1.7. Releasing lock_owner State . . . . . . . . . . . . . 85
7. pNFS protocol semantics . . . . . . . . . . . . . . . . . . 53 6.1.8. Use of Open Confirmation . . . . . . . . . . . . . . 85
7.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . 54 6.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 87
7.1.1 Layout Types . . . . . . . . . . . . . . . . . . . . . 54 6.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 87
7.1.2 Layout Iomode . . . . . . . . . . . . . . . . . . . . 54 6.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 87
7.1.3 Layout Segments . . . . . . . . . . . . . . . . . . . 55 6.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 88
7.1.4 Device IDs . . . . . . . . . . . . . . . . . . . . . . 56 6.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 89
7.1.5 Aggregation Schemes . . . . . . . . . . . . . . . . . 56 6.6.1. Client Failure and Recovery . . . . . . . . . . . . 89
7.2 Guarantees Provided by Layouts . . . . . . . . . . . . . . 56 6.6.2. Server Failure and Recovery . . . . . . . . . . . . 90
7.3 Getting a Layout . . . . . . . . . . . . . . . . . . . . . 58 6.6.3. Network Partitions and Recovery . . . . . . . . . . 92
7.4 Committing a Layout . . . . . . . . . . . . . . . . . . . 58 6.7. Recovery from a Lock Request Timeout or Abort . . . . . 95
7.4.1 LAYOUTCOMMIT and mtime/atime/change . . . . . . . . . 59 6.8. Server Revocation of Locks . . . . . . . . . . . . . . . 96
7.4.2 LAYOUTCOMMIT and size . . . . . . . . . . . . . . . . 60 6.9. Share Reservations . . . . . . . . . . . . . . . . . . . 97
7.4.3 LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . . 61 6.10. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 97
7.5 Recalling a Layout . . . . . . . . . . . . . . . . . . . . 61 6.10.1. Close and Retention of State Information . . . . . . 98
7.5.1 Basic Operation . . . . . . . . . . . . . . . . . . . 61 6.11. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 99
7.5.2 Recall Callback Robustness . . . . . . . . . . . . . . 62 6.12. Short and Long Leases . . . . . . . . . . . . . . . . . 99
7.5.3 Recall/Return Sequencing . . . . . . . . . . . . . . . 63 6.13. Clocks, Propagation Delay, and Calculating Lease
7.6 Metadata Server Write Propagation . . . . . . . . . . . . 65 Expiration . . . . . . . . . . . . . . . . . . . . . . . 100
7.7 Crash Recovery . . . . . . . . . . . . . . . . . . . . . . 66 6.14. Migration, Replication and State . . . . . . . . . . . . 100
7.7.1 Leases . . . . . . . . . . . . . . . . . . . . . . . . 66 6.14.1. Migration and State . . . . . . . . . . . . . . . . 101
7.7.2 Client Recovery . . . . . . . . . . . . . . . . . . . 67 6.14.2. Replication and State . . . . . . . . . . . . . . . 102
7.7.3 Metadata Server Recovery . . . . . . . . . . . . . . . 68 6.14.3. Notification of Migrated Lease . . . . . . . . . . . 102
7.7.4 Storage Device Recovery . . . . . . . . . . . . . . . 70 6.14.4. Migration and the Lease_time Attribute . . . . . . . 103
8. Security Considerations . . . . . . . . . . . . . . . . . . 71 7. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 103
8.1 File Layout Security . . . . . . . . . . . . . . . . . . . 72 7.1. Performance Challenges for Client-Side Caching . . . . . 104
8.2 Object Layout Security . . . . . . . . . . . . . . . . . . 72 7.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 105
8.3 Block/Volume Layout Security . . . . . . . . . . . . . . . 73 7.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 106
9. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . 74 7.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 108
9.1 File Striping and Data Access . . . . . . . . . . . . . . 74 7.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 108
9.1.1 Sparse and Dense Storage Device Data Layouts . . . . . 75 7.3.2. Data Caching and File Locking . . . . . . . . . . . 109
9.1.2 Metadata and Storage Device Roles . . . . . . . . . . 77 7.3.3. Data Caching and Mandatory File Locking . . . . . . 111
9.1.3 Device Multipathing . . . . . . . . . . . . . . . . . 78 7.3.4. Data Caching and File Identity . . . . . . . . . . . 111
9.1.4 Operations Issued to Storage Devices . . . . . . . . . 79 7.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 112
9.2 Global Stateid Requirements . . . . . . . . . . . . . . . 79 7.4.1. Open Delegation and Data Caching . . . . . . . . . . 115
9.3 The Layout Iomode . . . . . . . . . . . . . . . . . . . . 80 7.4.2. Open Delegation and File Locks . . . . . . . . . . . 116
9.4 Storage Device State Propagation . . . . . . . . . . . . . 80 7.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 116
9.4.1 Lock State Propagation . . . . . . . . . . . . . . . . 80 7.4.4. Recall of Open Delegation . . . . . . . . . . . . . 119
9.4.2 Open-mode Validation . . . . . . . . . . . . . . . . . 81 7.4.5. Clients that Fail to Honor Delegation Recalls . . . 121
9.4.3 File Attributes . . . . . . . . . . . . . . . . . . . 81 7.4.6. Delegation Revocation . . . . . . . . . . . . . . . 122
9.5 Storage Device Component File Size . . . . . . . . . . . . 82 7.5. Data Caching and Revocation . . . . . . . . . . . . . . 122
9.6 Crash Recovery Considerations . . . . . . . . . . . . . . 83 7.5.1. Revocation Recovery for Write Open Delegation . . . 123
9.7 Security Considerations . . . . . . . . . . . . . . . . . 83 7.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 124
9.8 Alternate Approaches . . . . . . . . . . . . . . . . . . . 84 7.7. Data and Metadata Caching and Memory Mapped Files . . . 126
10. pNFS Typed Data Structures . . . . . . . . . . . . . . . . . 85 7.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 128
10.1 pnfs_layouttype4 . . . . . . . . . . . . . . . . . . . . 85 7.9. Directory Caching . . . . . . . . . . . . . . . . . . . 129
10.2 pnfs_deviceid4 . . . . . . . . . . . . . . . . . . . . . 85 8. Security Negotiation . . . . . . . . . . . . . . . . . . . . 130
10.3 pnfs_deviceaddr4 . . . . . . . . . . . . . . . . . . . . 86 9. Clarification of Security Negotiation in NFSv4.1 . . . . . . 130
10.4 pnfs_devlist_item4 . . . . . . . . . . . . . . . . . . . 86 9.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 130
10.5 pnfs_layout4 . . . . . . . . . . . . . . . . . . . . . . 87 9.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 131
10.6 pnfs_layoutupdate4 . . . . . . . . . . . . . . . . . . . 87 9.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 131
10.7 pnfs_layouthint4 . . . . . . . . . . . . . . . . . . . . 88 9.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 131
10.8 pnfs_layoutiomode4 . . . . . . . . . . . . . . . . . . . 88 10. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 132
11. pNFS File Attributes . . . . . . . . . . . . . . . . . . . . 88 10.1. Sessions Background . . . . . . . . . . . . . . . . . . 132
11.1 pnfs_layouttype4<> FS_LAYOUT_TYPES . . . . . . . . . . . 88 10.1.1. Introduction to Sessions . . . . . . . . . . . . . . 132
11.2 pnfs_layouttype4<> FILE_LAYOUT_TYPES . . . . . . . . . . 88 10.1.2. Motivation . . . . . . . . . . . . . . . . . . . . . 133
11.3 pnfs_layouthint4 FILE_LAYOUT_HINT . . . . . . . . . . . 89 10.1.3. Problem Statement . . . . . . . . . . . . . . . . . 134
11.4 uint32_t FS_LAYOUT_PREFERRED_BLOCKSIZE . . . . . . . . . 89 10.1.4. NFSv4 Session Extension Characteristics . . . . . . 136
11.5 uint32_t FS_LAYOUT_PREFERRED_ALIGNMENT . . . . . . . . . 89 10.2. Transport Issues . . . . . . . . . . . . . . . . . . . . 136
12. pNFS Error Definitions . . . . . . . . . . . . . . . . . . . 89 10.2.1. Session Model . . . . . . . . . . . . . . . . . . . 136
13. Layouts and Aggregation . . . . . . . . . . . . . . . . . . 90 10.2.2. Connection State . . . . . . . . . . . . . . . . . . 137
13.1 Simple Map . . . . . . . . . . . . . . . . . . . . . . . 90 10.2.3. NFSv4 Channels, Sessions and Connections . . . . . . 138
13.2 Block Extent Map . . . . . . . . . . . . . . . . . . . . 90 10.2.4. Reconnection, Trunking and Failover . . . . . . . . 140
13.3 Striped Map (RAID 0) . . . . . . . . . . . . . . . . . . 90 10.2.5. Server Duplicate Request Cache . . . . . . . . . . . 141
13.4 Replicated Map . . . . . . . . . . . . . . . . . . . . . 91 10.3. Session Initialization and Transfer Models . . . . . . . 142
13.5 Concatenated Map . . . . . . . . . . . . . . . . . . . . 91 10.3.1. Session Negotiation . . . . . . . . . . . . . . . . 142
13.6 Nested Map . . . . . . . . . . . . . . . . . . . . . . . 91 10.3.2. RDMA Requirements . . . . . . . . . . . . . . . . . 144
14. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 91 10.3.3. RDMA Connection Resources . . . . . . . . . . . . . 144
14.1 LOOKUPP - Lookup Parent Directory . . . . . . . . . . . 91 10.3.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 145
14.2 SECINFO -- Obtain Available Security . . . . . . . . . . 93 10.3.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 148
14.3 SECINFO_NO_NAME - Get Security on Unnamed Object . . . . 96 10.4. Connection Models . . . . . . . . . . . . . . . . . . . 151
14.4 CREATECLIENTID - Instantiate Clientid . . . . . . . . . 98 10.4.1. TCP Connection Model . . . . . . . . . . . . . . . . 152
14.5 CREATESESSION - Create New Session and Confirm 10.4.2. Negotiated RDMA Connection Model . . . . . . . . . . 153
Clientid . . . . . . . . . . . . . . . . . . . . . . . . 104 10.4.3. Automatic RDMA Connection Model . . . . . . . . . . 154
14.6 BIND_BACKCHANNEL - Create a callback channel binding . . 109 10.5. Buffer Management, Transfer, Flow Control . . . . . . . 154
14.7 DESTROYSESSION - Destroy existing session . . . . . . . 112 10.6. Retry and Replay . . . . . . . . . . . . . . . . . . . . 157
14.8 SEQUENCE - Supply per-procedure sequencing and control . 113 10.7. The Back Channel . . . . . . . . . . . . . . . . . . . . 158
14.9 CB_RECALLCREDIT - change flow control limits . . . . . . 114 10.8. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 159
14.10 CB_SEQUENCE - Supply callback channel sequencing and 10.9. Data Alignment . . . . . . . . . . . . . . . . . . . . . 159
control . . . . . . . . . . . . . . . . . . . . . . . . 115 10.10. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 161
14.11 GET_DIR_DELEGATION - Get a directory delegation . . . . 117 10.10.1. Minor Versioning . . . . . . . . . . . . . . . . . . 161
14.12 CB_NOTIFY - Notify directory changes . . . . . . . . . . 120 10.10.2. Slot Identifiers and Server Duplicate Request
14.13 CB_RECALL_ANY - Keep any N delegations . . . . . . . . . 124 Cache . . . . . . . . . . . . . . . . . . . . . . . 161
14.14 LAYOUTGET - Get Layout Information . . . . . . . . . . . 126 10.10.3. Resolving server callback races with sessions . . . 165
14.15 LAYOUTCOMMIT - Commit writes made using a layout . . . . 128 10.10.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 166
14.16 LAYOUTRETURN - Release Layout Information . . . . . . . 131 10.10.5. eXternal Data Representation Efficiency . . . . . . 167
14.17 GETDEVICEINFO - Get Device Information . . . . . . . . . 133 10.10.6. Effect of Sessions on Existing Operations . . . . . 167
14.18 GETDEVICELIST - Get List of Devices . . . . . . . . . . 134 10.10.7. Authentication Efficiencies . . . . . . . . . . . . 168
14.19 CB_LAYOUTRECALL . . . . . . . . . . . . . . . . . . . . 136 10.11. Sessions Security Considerations . . . . . . . . . . . . 169
14.20 CB_SIZECHANGED . . . . . . . . . . . . . . . . . . . . . 138 10.11.1. Authentication . . . . . . . . . . . . . . . . . . . 171
15. References . . . . . . . . . . . . . . . . . . . . . . . . . 139 11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 172
15.1 Normative References . . . . . . . . . . . . . . . . . . 139 11.1. Introduction to Directory Delegations . . . . . . . . . 172
15.2 Informative References . . . . . . . . . . . . . . . . . 139 11.2. Directory Delegation Design (in brief) . . . . . . . . . 173
Author's Address . . . . . . . . . . . . . . . . . . . . . . 139 11.3. Recommended Attributes in support of Directory
A. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 139 Delegations . . . . . . . . . . . . . . . . . . . . . . 174
Intellectual Property and Copyright Statements . . . . . . . 141 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 175
11.5. Delegation Recovery . . . . . . . . . . . . . . . . . . 175
12. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 175
13. General Definitions . . . . . . . . . . . . . . . . . . . . . 178
13.1. Metadata Server . . . . . . . . . . . . . . . . . . . . 178
13.2. Client . . . . . . . . . . . . . . . . . . . . . . . . . 178
13.3. Storage Device . . . . . . . . . . . . . . . . . . . . . 178
13.4. Storage Protocol . . . . . . . . . . . . . . . . . . . . 179
13.5. Control Protocol . . . . . . . . . . . . . . . . . . . . 179
13.6. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 179
13.7. Layout . . . . . . . . . . . . . . . . . . . . . . . . . 180
14. pNFS protocol semantics . . . . . . . . . . . . . . . . . . . 180
14.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 180
14.1.1. Layout Types . . . . . . . . . . . . . . . . . . . . 180
14.1.2. Layout Iomode . . . . . . . . . . . . . . . . . . . 181
14.1.3. Layout Segments . . . . . . . . . . . . . . . . . . 181
14.1.4. Device IDs . . . . . . . . . . . . . . . . . . . . . 182
14.1.5. Aggregation Schemes . . . . . . . . . . . . . . . . 183
14.2. Guarantees Provided by Layouts . . . . . . . . . . . . . 183
14.3. Getting a Layout . . . . . . . . . . . . . . . . . . . . 184
14.4. Committing a Layout . . . . . . . . . . . . . . . . . . 185
14.4.1. LAYOUTCOMMIT and mtime/atime/change . . . . . . . . 186
14.4.2. LAYOUTCOMMIT and size . . . . . . . . . . . . . . . 186
14.4.3. LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . 187
14.5. Recalling a Layout . . . . . . . . . . . . . . . . . . . 187
14.5.1. Basic Operation . . . . . . . . . . . . . . . . . . 188
14.5.2. Recall Callback Robustness . . . . . . . . . . . . . 189
14.5.3. Recall/Return Sequencing . . . . . . . . . . . . . . 190
14.6. Metadata Server Write Propagation . . . . . . . . . . . 192
14.7. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 193
14.7.1. Leases . . . . . . . . . . . . . . . . . . . . . . . 193
14.7.2. Client Recovery . . . . . . . . . . . . . . . . . . 194
14.7.3. Metadata Server Recovery . . . . . . . . . . . . . . 195
14.7.4. Storage Device Recovery . . . . . . . . . . . . . . 197
15. Security Considerations . . . . . . . . . . . . . . . . . . . 198
15.1. File Layout Security . . . . . . . . . . . . . . . . . . 199
15.2. Object Layout Security . . . . . . . . . . . . . . . . . 199
15.3. Block/Volume Layout Security . . . . . . . . . . . . . . 201
16. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . 201
16.1. File Striping and Data Access . . . . . . . . . . . . . 202
16.1.1. Sparse and Dense Storage Device Data Layouts . . . . 203
16.1.2. Metadata and Storage Device Roles . . . . . . . . . 205
16.1.3. Device Multipathing . . . . . . . . . . . . . . . . 206
16.1.4. Operations Issued to Storage Devices . . . . . . . . 206
16.2. Global Stateid Requirements . . . . . . . . . . . . . . 207
16.3. The Layout Iomode . . . . . . . . . . . . . . . . . . . 207
16.4. Storage Device State Propagation . . . . . . . . . . . . 208
16.4.1. Lock State Propagation . . . . . . . . . . . . . . . 208
16.4.2. Open-mode Validation . . . . . . . . . . . . . . . . 209
16.4.3. File Attributes . . . . . . . . . . . . . . . . . . 209
16.5. Storage Device Component File Size . . . . . . . . . . . 210
16.6. Crash Recovery Considerations . . . . . . . . . . . . . 211
16.7. Security Considerations . . . . . . . . . . . . . . . . 211
16.8. Alternate Approaches . . . . . . . . . . . . . . . . . . 211
17. Layouts and Aggregation . . . . . . . . . . . . . . . . . . . 212
17.1. Simple Map . . . . . . . . . . . . . . . . . . . . . . . 213
17.2. Block Extent Map . . . . . . . . . . . . . . . . . . . . 213
17.3. Striped Map (RAID 0) . . . . . . . . . . . . . . . . . . 213
17.4. Replicated Map . . . . . . . . . . . . . . . . . . . . . 213
17.5. Concatenated Map . . . . . . . . . . . . . . . . . . . . 214
17.6. Nested Map . . . . . . . . . . . . . . . . . . . . . . . 214
18. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 214
19. Internationalization . . . . . . . . . . . . . . . . . . . . 216
19.1. Stringprep profile for the utf8str_cs type . . . . . . . 218
19.2. Stringprep profile for the utf8str_cis type . . . . . . 219
19.3. Stringprep profile for the utf8str_mixed type . . . . . 221
19.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 222
20. Error Definitions . . . . . . . . . . . . . . . . . . . . . . 222
21. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 231
21.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 231
21.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 232
22. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 234
22.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 235
22.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 237
22.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 238
22.4. Operation 6: CREATE - Create a Non-Regular File Object . 241
22.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 244
22.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 245
22.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 245
22.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 247
22.9. Operation 11: LINK - Create Link to a File . . . . . . . 248
22.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 249
22.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 253
22.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 255
22.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 256
22.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 258
22.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 259
22.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 260
22.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 269
22.18. Operation 20: OPEN_CONFIRM - Confirm Open . . . . . . . 271
22.19. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 273
22.20. Operation 22: PUTFH - Set Current Filehandle . . . . . . 274
22.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 275
22.22. Operation 25: READ - Read from File . . . . . . . . . . 276
22.23. Operation 26: READDIR - Read Directory . . . . . . . . . 278
22.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 282
22.25. Operation 28: REMOVE - Remove Filesystem Object . . . . 283
22.26. Operation 29: RENAME - Rename Directory Entry . . . . . 285
22.27. Operation 30: RENEW - Renew a Lease . . . . . . . . . . 287
22.28. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 288
22.29. Operation 32: SAVEFH - Save Current Filehandle . . . . . 289
22.30. Operation 33: SECINFO - Obtain Available Security . . . 290
22.31. Operation 34: SETATTR - Set Attributes . . . . . . . . . 293
22.32. Operation 35: SETCLIENTID - Negotiate Clientid . . . . . 296
22.33. Operation 36: SETCLIENTID_CONFIRM - Confirm Clientid . . 300
22.34. Operation 37: VERIFY - Verify Same Attributes . . . . . 303
22.35. Operation 38: WRITE - Write to File . . . . . . . . . . 304
22.36. Operation 39: RELEASE_LOCKOWNER - Release Lockowner
State . . . . . . . . . . . . . . . . . . . . . . . . . 309
22.37. Operation 10044: ILLEGAL - Illegal operation . . . . . . 310
22.38. SECINFO_NO_NAME - Get Security on Unnamed Object . . . . 310
22.39. CREATECLIENTID - Instantiate Clientid . . . . . . . . . 312
22.40. CREATESESSION - Create New Session and Confirm
Clientid . . . . . . . . . . . . . . . . . . . . . . . . 317
22.41. BIND_BACKCHANNEL - Create a callback channel binding . . 322
22.42. DESTROYSESSION - Destroy existing session . . . . . . . 324
22.43. SEQUENCE - Supply per-procedure sequencing and control . 325
22.44. GET_DIR_DELEGATION - Get a directory delegation . . . . 326
22.45. LAYOUTGET - Get Layout Information . . . . . . . . . . . 330
22.46. LAYOUTCOMMIT - Commit writes made using a layout . . . . 332
22.47. LAYOUTRETURN - Release Layout Information . . . . . . . 336
22.48. GETDEVICEINFO - Get Device Information . . . . . . . . . 337
22.49. GETDEVICELIST . . . . . . . . . . . . . . . . . . . . . 338
23. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 340
23.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 340
23.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 340
24. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 342
24.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 342
24.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 343
24.3. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . . 344
24.4. CB_RECALLCREDIT - change flow control limits . . . . . . 345
24.5. CB_SEQUENCE - Supply callback channel sequencing and
control . . . . . . . . . . . . . . . . . . . . . . . . 346
24.6. CB_NOTIFY - Notify directory changes . . . . . . . . . . 348
24.7. CB_RECALL_ANY - Keep any N delegations . . . . . . . . . 351
24.8. CB_SIZECHANGED . . . . . . . . . . . . . . . . . . . . . 354
24.9. CB_LAYOUTRECALL . . . . . . . . . . . . . . . . . . . . 355
25. References . . . . . . . . . . . . . . . . . . . . . . . . . 357
25.1. Normative References . . . . . . . . . . . . . . . . . . 357
25.2. Informative References . . . . . . . . . . . . . . . . . 357
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 358
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 358
Intellectual Property and Copyright Statements . . . . . . . . . 360
1. Security Negotiation 1. Protocol Data Types
The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR RFC1832 [2] and RPC RFC1831
[3] documents. The next sections build upon the XDR data types to
define types and structures specific to this protocol.
1.1. Basic Data Types
These are the base NFSv4 data types.
+---------------+---------------------------------------------------+
| Data Type | Definition |
+---------------+---------------------------------------------------+
| int32_t | typedef int int32_t; |
| uint32_t | typedef unsigned int uint32_t; |
| int64_t | typedef hyper int64_t; |
| uint64_t | typedef unsigned hyper uint64_t; |
| attrlist4 | typedef opaque attrlist4&lt> |
| | Used for file/directory attributes |
| bitmap4 | typedef uint32_t bitmap4&lt> |
| | Used in attribute array encoding. |
| changeid4 | typedef uint64_t changeid4; |
| | Used in definition of change_info |
| clientid4 | typedef uint64_t clientid4; |
| | Shorthand reference to client identification |
| component4 | typedef utf8str_cs component4; |
| | Represents path name components |
| count4 | typedef uint32_t count4; |
| | Various count parameters (READ, WRITE, COMMIT) |
| length4 | typedef uint64_t length4; |
| | Describes LOCK lengths |
| linktext4 | typedef utf8str_cs linktext4; |
| | Symbolic link contents |
| mode4 | typedef uint32_t mode4; |
| | Mode attribute data type |
| nfs_cookie4 | typedef uint64_t nfs_cookie4; |
| | Opaque cookie value for READDIR |
| nfs_fh4 | typedef opaque nfs_fh4&ltNFS4_FHSIZE> |
| | Filehandle definition; NFS4_FHSIZE is defined as |
| | 128 |
| nfs_ftype4 | enum nfs_ftype4; |
| | Various defined file types |
| nfsstat4 | enum nfsstat4; |
| | Return value for operations |
| offset4 | typedef uint64_t offset4; |
| | Various offset designations (READ, WRITE, LOCK, |
| | COMMIT) |
| pathname4 | typedef component4 pathname4&lt> |
| | Represents path name for fs_locations |
| qop4 | typedef uint32_t qop4; |
| | Quality of protection designation in SECINFO |
| sec_oid4 | typedef opaque sec_oid4&lt> |
| | Security Object Identifier The sec_oid4 data type |
| | is not really opaque. Instead contains an ASN.1 |
| | OBJECT IDENTIFIER as used by GSS-API in the |
| | mech_type argument to GSS_Init_sec_context. See |
| | RFC2743 [4] for details. |
| seqid4 | typedef uint32_t seqid4; |
| | Sequence identifier used for file locking |
| utf8string | typedef opaque utf8string&lt> |
| | UTF-8 encoding for strings |
| utf8str_cis | typedef opaque utf8str_cis; |
| | Case-insensitive UTF-8 string |
| utf8str_cs | typedef opaque utf8str_cs; |
| | Case-sensitive UTF-8 string |
| utf8str_mixed | typedef opaque utf8str_mixed; |
| | UTF-8 strings with a case sensitive prefix and a |
| | case insensitive suffix. |
| verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; |
| | Verifier used for various operations (COMMIT, |
| | CREATE, OPEN, READDIR, SETCLIENTID, |
| | SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is |
| | defined as 8. |
+---------------+---------------------------------------------------+
End of Base Data Types
Table 1
1.2. Structured Data Types
1.2.1. nfstime4
struct nfstime4 {
int64_t seconds;
uint32_t nseconds;
}
The nfstime4 structure gives the number of seconds and nanoseconds
since midnight or 0 hour January 1, 1970 Coordinated Universal Time
(UTC). Values greater than zero for the seconds field denote dates
after the 0 hour January 1, 1970. Values less than zero for the
seconds field denote dates before the 0 hour January 1, 1970. In
both cases, the nseconds field is to be added to the seconds field
for the final time representation. For example, if the time to be
represented is one-half second before 0 hour January 1, 1970, the
seconds field would have a value of negative one (-1) and the
nseconds fields would have a value of one-half second (500000000).
Values greater than 999,999,999 for nseconds are considered invalid.
This data type is used to pass time and date information. A server
converts to and from its local representation of time when processing
time values, preserving as much accuracy as possible. If the
precision of timestamps stored for a filesystem object is less than
defined, loss of precision can occur. An adjunct time maintenance
protocol is recommended to reduce client and server time skew.
1.2.2. time_how4
enum time_how4 {
SET_TO_SERVER_TIME4 = 0,
SET_TO_CLIENT_TIME4 = 1
};
1.2.3. settime4
union settime4 switch (time_how4 set_it) {
case SET_TO_CLIENT_TIME4:
nfstime4 time;
default:
void;
};
The above definitions are used as the attribute definitions to set
time values. If set_it is SET_TO_SERVER_TIME4, then the server uses
its local representation of time for the time value.
1.2.4. specdata4
struct specdata4 {
uint32_t specdata1; /* major device number */
uint32_t specdata2; /* minor device number */
};
This data type represents additional information for the device file
types NF4CHR and NF4BLK.
1.2.5. fsid4
struct fsid4 {
uint64_t major;
uint64_t minor;
};
1.2.6. fs_location4
struct fs_location4 {
utf8str_cis server&lt>
pathname4 rootpath;
};
1.2.7. fs_locations4
struct fs_locations4 {
pathname4 fs_root;
fs_location4 locations&lt>
};
The fs_location4 and fs_locations4 data types are used for the
fs_locations recommended attribute which is used for migration and
replication support.
1.2.8. fattr4
struct fattr4 {
bitmap4 attrmask;
attrlist4 attr_vals;
};
The fattr4 structure is used to represent file and directory
attributes.
The bitmap is a counted array of 32 bit integers used to contain bit
values. The position of the integer in the array that contains bit n
can be computed from the expression (n / 32) and its bit within that
integer is (n mod 32).
0 1
+-----------+-----------+-----------+--
| count | 31 .. 0 | 63 .. 32 |
+-----------+-----------+-----------+--
1.2.9. change_info4
struct change_info4 {
bool atomic;
changeid4 before;
changeid4 after;
};
This structure is used with the CREATE, LINK, REMOVE, RENAME
operations to let the client know the value of the change attribute
for the directory in which the target filesystem object resides.
1.2.10. clientaddr4
struct clientaddr4 {
/* see struct rpcb in RFC1833 */
string r_netid&lt> /* network id */
string r_addr&lt> /* universal address */
};
The clientaddr4 structure is used as part of the SETCLIENTID
operation to either specify the address of the client that is using a
clientid or as part of the callback registration. The r_netid and
r_addr fields are specified in RFC1833 [9], but they are
underspecified in RFC1833 [9] as far as what they should look like
for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
US-ASCII string:
h1.h2.h3.h4.p1.p2
The prefix, "h1.h2.h3.h4", is the standard textual form for
representing an IPv4 address, which is always four octets long.
Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
the first through fourth octets each converted to ASCII-decimal.
Assuming big-endian ordering, p1 and p2 are, respectively, the first
and second octets each converted to ASCII-decimal. For example, if a
host, in big-endian order, has an address of 0x0A010307 and there is
a service listening on, in big endian order, port 0x020F (decimal
527), then complete universal address is "10.1.3.7.2.15".
For TCP over IPv4 the value of r_netid is the string "tcp". For UDP
over IPv4 the value of r_netid is the string "udp".
For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the
US-ASCII string:
x1:x2:x3:x4:x5:x6:x7:x8.p1.p2
The suffix "p1.p2" is the service port, and is computed the same way
as with universal addresses for TCP and UDP over IPv4. The prefix,
"x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for
representing an IPv6 address as defined in Section 2.2 of RFC1884
[5]. Additionally, the two alternative forms specified in Section
2.2 of RFC1884 [5] are also acceptable.
For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP
over IPv6 the value of r_netid is the string "udp6".
1.2.11. cb_client4
struct cb_client4 {
unsigned int cb_program;
clientaddr4 cb_location;
};
This structure is used by the client to inform the server of its call
back address; includes the program number and client address.
1.2.12. nfs_client_id4
struct nfs_client_id4 {
verifier4 verifier;
opaque id&ltNFS4_OPAQUE_LIMIT>
};
This structure is part of the arguments to the SETCLIENTID operation.
NFS4_OPAQUE_LIMIT is defined as 1024.
1.2.13. open_owner4
struct open_owner4 {
clientid4 clientid;
opaque owner&ltNFS4_OPAQUE_LIMIT>
};
This structure is used to identify the owner of open state.
NFS4_OPAQUE_LIMIT is defined as 1024.
1.2.14. lock_owner4
struct lock_owner4 {
clientid4 clientid;
opaque owner&ltNFS4_OPAQUE_LIMIT>
};
This structure is used to identify the owner of file locking state.
NFS4_OPAQUE_LIMIT is defined as 1024.
1.2.15. open_to_lock_owner4
struct open_to_lock_owner4 {
seqid4 open_seqid;
stateid4 open_stateid;
seqid4 lock_seqid;
lock_owner4 lock_owner;
};
This structure is used for the first LOCK operation done for an
open_owner4. It provides both the open_stateid and lock_owner such
that the transition is made from a valid open_stateid sequence to
that of the new lock_stateid sequence. Using this mechanism avoids
the confirmation of the lock_owner/lock_seqid pair since it is tied
to established state in the form of the open_stateid/open_seqid.
1.2.16. stateid4
struct stateid4 {
uint32_t seqid;
opaque other[12];
};
This structure is used for the various state sharing mechanisms
between the client and server. For the client, this data structure
is read-only. The starting value of the seqid field is undefined.
The server is required to increment the seqid field monotonically at
each transition of the stateid. This is important since the client
will inspect the seqid in OPEN stateids to determine the order of
OPEN processing done by the server.
1.2.17. layouttype4
enum layouttype4 {
LAYOUT_NFSV4_FILES = 1,
LAYOUT_OSD2_OBJECTS = 2,
LAYOUT_BLOCK_VOLUME = 3
};
A layout type specifies the layout being used. The implication is
that clients have "layout drivers" that support one or more layout
types. The file server advertises the layout types it supports
through the LAYOUT_TYPES file system attribute. A client asks for
layouts of a particular type in LAYOUTGET, and passes those layouts
to its layout driver. The set of well known layout types must be
defined. As well, a private range of layout types is to be defined
by this document. This would allow custom installations to introduce
new layout types.
[[Comment.1: Determine private range of layout types]]
New layout types must be specified in RFCs approved by the IESG
before becoming part of the pNFS specification.
The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file
layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration
specifies that the object layout, as defined in [10], is to be used.
Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume
layout, as defined in [11], is to be used.
1.2.18. pnfs_deviceid4
typedef uint32_t pnfs_deviceid4; /* 32-bit device ID */
Layout information includes device IDs that specify a storage device
through a compact handle. Addressing and type information is
obtained with the GETDEVICEINFO operation. A client must not assume
that device IDs are valid across metadata server reboots. The device
ID is qualified by the layout type and are unique per file system
(FSID). This allows different layout drivers to generate device IDs
without the need for co-ordination. See Section 14.1.4 for more
details.
1.2.19. pnfs_deviceaddr4
struct pnfs_netaddr4 {
string r_netid&lt> /* network ID */
string r_addr&lt> /* universal address */
};
struct pnfs_deviceaddr4 {
pnfs_layouttype4 type;
opaque device_addr&lt>
};
The device address is used to set up a communication channel with the
storage device. Different layout types will require different types
of structures to define how they communicate with storage devices.
The opaque device_addr field must be interpreted based on the
specified layout type.
Currently, the only defined device address is that for the NFSv4 file
layout (struct pnfs_netaddr4), which identifies a storage device by
network IP address and port number. This is sufficient for the
clients to communicate with the NFSv4 storage devices, and may also
be sufficient for object-based storage drivers to communicate with
OSDs. The other device address we expect to support is a SCSI volume
identifier. The final protocol specification will detail the allowed
values for device_type and the format of their associated location
information.
[NOTE: other device addresses will be added as the respective
specifications mature. It has been suggested that a separate
device_type enumeration is used as a switch to the pnfs_deviceaddr4
structure (e.g., if multiple types of addresses exist for the same
layout type). Until such a time as a real case is made and the
respective layout types have matured, the device address structure
will be left as is.]
1.2.20. pnfs_devlist_item4
struct pnfs_devlist_item4 {
pnfs_deviceid4 id;
pnfs_deviceaddr4 addr;
};
An array of these values is returned by the GETDEVICELIST operation.
They define the set of devices associated with a file system.
1.2.21. pnfs_layout4
struct pnfs_layout4 {
offset4 offset;
length4 length;
pnfs_layoutiomode4 iomode;
pnfs_layouttype4 type;
opaque layout<>;
};
The pnfs_layout4 structure defines a layout for a file. The layout
type specific data is opaque within this structure and must be
interepreted based on the layout type. Currently, only the NFSv4
file layout type is defined; see Section 16.1 for its definition.
Since layouts are sub-dividable, the offset and length together with
the file's filehandle, the clientid, iomode, and layout type,
identifies the layout.
[[Comment.2: there is a discussion of moving the striping
information, or more generally the "aggregation scheme", up to the
generic layout level. This creates a two-layer system where the top
level is a switch on different data placement layouts, and the next
level down is a switch on different data storage types. This lets
different layouts (e.g., striping or mirroring or redundant servers)
to be layered over different storage devices. This would move
geometry information out of nfsv4_file_layouttype4 and up into a
generic pnfs_striped_layout type that would specify a set of
pnfs_deviceid4 and pnfs_devicetype4 to use for storage. Instead of
nfsv4_file_layouttype4, there would be pnfs_nfsv4_devicetype4.]]
1.2.22. pnfs_layoutupdate4
struct pnfs_layoutupdate4 {
pnfs_layouttype4 type;
opaque layoutupdate_data<>;
};
The pnfs_layoutupdate4 structure is used by the client to return
'updated' layout information to the metadata server at LAYOUTCOMMIT
time. This structure provides a channel to pass layout type specific
information back to the metadata server. E.g., for block/volume
layout types this could include the list of reserved blocks that were
written. The contents of the opaque layoutupdate_data argument are
determined by the layout type and are defined in their context. The
NFSv4 file-based layout does not use this structure, thus the
update_data field should have a zero length.
1.2.23. layouthint4
struct pnfs_layouthint4 {
pnfs_layouttype4 type;
opaque layouthint_data&lt>
};
The layouthint4 structure is used by the client to pass in a hint
about the type of layout it would like created for a particular file.
It is the structure specified by the FILE_LAYOUT_HINT attribute
described below. The metadata server may ignore the hint, or may
selectively ignore fields within the hint. This hint should be
provided at create time as part of the initial attributes within
OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint"
structure as defined in Section 16.1.
1.2.24. pnfs_layoutiomode4
enum pnfs_layoutiomode4 {
LAYOUTIOMODE_READ = 1,
LAYOUTIOMODE_RW = 2,
LAYOUTIOMODE_ANY = 3
};
The iomode specifies whether the client intends to read or write
(with the possibility of reading) the data represented by the layout.
The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be
used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies
that layouts pertaining to both READ and RW iomodes are being
returned or recalled, respectively. The metadata server's use of the
iomode may depend on the layout type being used. The storage devices
may validate I/O accesses against the iomode and reject invalid
accesses.
1.2.25. nfs_impl_id4
struct nfs_impl_id4 {
utf8str_cis nii_domain;
utf8str_cs nii_name;
nfstime4 nii_date;
};
This structure is used to identify client and server implementation
detail. The nii_domain field is the DNS domain name that the
implementer is associated with. The nii_name field is the product
name of the implementation and is completely free form. It is
encouraged that the nii_name be used to distinguish machine
architecture, machine platforms, revisions, versions, and patch
levels. The nii_date field is the timestamp of when the software
instance was published or built.
1.2.26. impl_ident4
struct impl_ident4 {
clientid4 ii_clientid;
struct nfs_impl_id4 ii_impl_id;
};
This is used for exchanging implementation identification between
client and server.
2. Filehandles
The filehandle in the NFS protocol is a per server unique identifier
for a filesystem object. The contents of the filehandle are opaque
to the client. Therefore, the server is responsible for translating
the filehandle to an internal representation of the filesystem
object.
2.1. Obtaining the First Filehandle
The operations of the NFS protocol are defined in terms of one or
more filehandles. Therefore, the client needs a filehandle to
initiate communication with the server. With the NFS version 2
protocol [RFC1094] and the NFS version 3 protocol [RFC1813], there
exists an ancillary protocol to obtain this first filehandle. The
MOUNT protocol, RPC program number 100005, provides the mechanism of
translating a string based filesystem path name to a filehandle which
can then be used by the NFS protocols.
The MOUNT protocol has deficiencies in the area of security and use
via firewalls. This is one reason that the use of the public
filehandle was introduced in [RFC2054] and [RFC2055]. With the use
of the public filehandle in combination with the LOOKUP operation in
the NFS version 2 and 3 protocols, it has been demonstrated that the
MOUNT protocol is unnecessary for viable interaction between NFS
client and server.
Therefore, the NFS version 4 protocol will not use an ancillary
protocol for translation from string based path names to a
filehandle. Two special filehandles will be used as starting points
for the NFS client.
2.1.1. Root Filehandle
The first of the special filehandles is the ROOT filehandle. The
ROOT filehandle is the "conceptual" root of the filesystem name space
at the NFS server. The client uses or starts with the ROOT
filehandle by employing the PUTROOTFH operation. The PUTROOTFH
operation instructs the server to set the "current" filehandle to the
ROOT of the server's file tree. Once this PUTROOTFH operation is
used, the client can then traverse the entirety of the server's file
tree with the LOOKUP operation. A complete discussion of the server
name space is in the section "NFS Server Name Space".
2.1.2. Public Filehandle
The second special filehandle is the PUBLIC filehandle. Unlike the
ROOT filehandle, the PUBLIC filehandle may be bound or represent an
arbitrary filesystem object at the server. The server is responsible
for this binding. It may be that the PUBLIC filehandle and the ROOT
filehandle refer to the same filesystem object. However, it is up to
the administrative software at the server and the policies of the
server administrator to define the binding of the PUBLIC filehandle
and server filesystem object. The client may not make any
assumptions about this binding. The client uses the PUBLIC
filehandle via the PUTPUBFH operation.
2.2. Filehandle Types
In the NFS version 2 and 3 protocols, there was one type of
filehandle with a single set of semantics. This type of filehandle
is termed "persistent" in NFS Version 4. The semantics of a
persistent filehandle remain the same as before. A new type of
filehandle introduced in NFS Version 4 is the "volatile" filehandle,
which attempts to accommodate certain server environments.
The volatile filehandle type was introduced to address server
functionality or implementation issues which make correct
implementation of a persistent filehandle infeasible. Some server
environments do not provide a filesystem level invariant that can be
used to construct a persistent filehandle. The underlying server
filesystem may not provide the invariant or the server's filesystem
programming interfaces may not provide access to the needed
invariant. Volatile filehandles may ease the implementation of
server functionality such as hierarchical storage management or
filesystem reorganization or migration. However, the volatile
filehandle increases the implementation burden for the client.
Since the client will need to handle persistent and volatile
filehandles differently, a file attribute is defined which may be
used by the client to determine the filehandle types being returned
by the server.
2.2.1. General Properties of a Filehandle
The filehandle contains all the information the server needs to
distinguish an individual file. To the client, the filehandle is
opaque. The client stores filehandles for use in a later request and
can compare two filehandles from the same server for equality by
doing a byte-by-byte comparison. However, the client MUST NOT
otherwise interpret the contents of filehandles. If two filehandles
from the same server are equal, they MUST refer to the same file.
Servers SHOULD try to maintain a one-to-one correspondence between
filehandles and files but this is not required. Clients MUST use
filehandle comparisons only to improve performance, not for correct
behavior. All clients need to be prepared for situations in which it
cannot be determined whether two filehandles denote the same object
and in such cases, avoid making invalid assumptions which might cause
incorrect behavior. Further discussion of filehandle and attribute
comparison in the context of data caching is presented in the section
"Data Caching and File Identity".
As an example, in the case that two different path names when
traversed at the server terminate at the same filesystem object, the
server SHOULD return the same filehandle for each path. This can
occur if a hard link is used to create two file names which refer to
the same underlying file object and associated data. For example, if
paths /a/b/c and /a/d/c refer to the same file, the server SHOULD
return the same filehandle for both path names traversals.
2.2.2. Persistent Filehandle
A persistent filehandle is defined as having a fixed value for the
lifetime of the filesystem object to which it refers. Once the
server creates the filehandle for a filesystem object, the server
MUST accept the same filehandle for the object for the lifetime of
the object. If the server restarts or reboots the NFS server must
honor the same filehandle value as it did in the server's previous
instantiation. Similarly, if the filesystem is migrated, the new NFS
server must honor the same filehandle as the old NFS server.
The persistent filehandle will be become stale or invalid when the
filesystem object is removed. When the server is presented with a
persistent filehandle that refers to a deleted object, it MUST return
an error of NFS4ERR_STALE. A filehandle may become stale when the
filesystem containing the object is no longer available. The file
system may become unavailable if it exists on removable media and the
media is no longer available at the server or the filesystem in whole
has been destroyed or the filesystem has simply been removed from the
server's name space (i.e. unmounted in a UNIX environment).
2.2.3. Volatile Filehandle
A volatile filehandle does not share the same longevity
characteristics of a persistent filehandle. The server may determine
that a volatile filehandle is no longer valid at many different
points in time. If the server can definitively determine that a
volatile filehandle refers to an object that has been removed, the
server should return NFS4ERR_STALE to the client (as is the case for
persistent filehandles). In all other cases where the server
determines that a volatile filehandle can no longer be used, it
should return an error of NFS4ERR_FHEXPIRED.
The mandatory attribute "fh_expire_type" is used by the client to
determine what type of filehandle the server is providing for a
particular filesystem. This attribute is a bitmask with the
following values:
FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a
persistent filehandle, which is valid until the object is removed
from the filesystem. The server will not return NFS4ERR_FHEXPIRED
for this filehandle. FH4_PERSISTENT is defined as a value in
which none of the bits specified below are set.
FH4_VOLATILE_ANY The filehandle may expire at any time, except as
specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN).
FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set.
If this bit is set, then the meaning of FH4_VOLATILE_ANY is
qualified to exclude any expiration of the filehandle when it is
open.
FH4_VOL_MIGRATION The filehandle will expire as a result of
migration. If FH4_VOL_ANY is set, FH4_VOL_MIGRATION is redundant.
FH4_VOL_RENAME The filehandle will expire during rename. This
includes a rename by the requesting client or a rename by any
other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant.
Servers which provide volatile filehandles that may expire while open
(i.e. if FH4_VOL_MIGRATION or FH4_VOL_RENAME is set or if
FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set), should
deny a RENAME or REMOVE that would affect an OPEN file of any of the
components leading to the OPEN file. In addition, the server should
deny all RENAME or REMOVE requests during the grace period upon
server restart.
Note that the bits FH4_VOL_MIGRATION and FH4_VOL_RENAME allow the
client to determine that expiration has occurred whenever a specific
event occurs, without an explicit filehandle expiration error from
the server. FH4_VOL_ANY does not provide this form of information.
In situations where the server will expire many, but not all
filehandles upon migration (e.g. all but those that are open),
FH4_VOLATILE_ANY (in this case with FH4_NOEXPIRE_WITH_OPEN) is a
better choice since the client may not assume that all filehandles
will expire when migration occurs, and it is likely that additional
expirations will occur (as a result of file CLOSE) that are separated
in time from the migration event itself.
2.3. One Method of Constructing a Volatile Filehandle
A volatile filehandle, while opaque to the client could contain:
[volatile bit = 1 | server boot time | slot | generation number]
o slot is an index in the server volatile filehandle table
o generation number is the generation number for the table entry/
slot
When the client presents a volatile filehandle, the server makes the
following checks, which assume that the check for the volatile bit
has passed. If the server boot time is less than the current server
boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return
NFS4ERR_BADHANDLE. If the generation number does not match, return
NFS4ERR_FHEXPIRED.
When the server reboots, the table is gone (it is volatile).
If volatile bit is 0, then it is a persistent filehandle with a
different structure following it.
2.4. Client Recovery from Filehandle Expiration
If possible, the client SHOULD recover from the receipt of an
NFS4ERR_FHEXPIRED error. The client must take on additional
responsibility so that it may prepare itself to recover from the
expiration of a volatile filehandle. If the server returns
persistent filehandles, the client does not need these additional
steps.
For volatile filehandles, most commonly the client will need to store
the component names leading up to and including the filesystem object
in question. With these names, the client should be able to recover
by finding a filehandle in the name space that is still available or
by starting at the root of the server's filesystem name space.
If the expired filehandle refers to an object that has been removed
from the filesystem, obviously the client will not be able to recover
from the expired filehandle.
It is also possible that the expired filehandle refers to a file that
has been renamed. If the file was renamed by another client, again
it is possible that the original client will not be able to recover.
However, in the case that the client itself is renaming the file and
the file is open, it is possible that the client may be able to
recover. The client can determine the new path name based on the
processing of the rename request. The client can then regenerate the
new filehandle based on the new path name. The client could also use
the compound operation mechanism to construct a set of operations
like:
RENAME A B
LOOKUP B
GETFH
Note that the COMPOUND procedure does not provide atomicity. This
example only reduces the overhead of recovering from an expired
filehandle.
3. File Attributes
To meet the requirements of extensibility and increased
interoperability with non-UNIX platforms, attributes must be handled
in a flexible manner. The NFS version 3 fattr3 structure contains a
fixed list of attributes that not all clients and servers are able to
support or care about. The fattr3 structure can not be extended as
new needs arise and it provides no way to indicate non-support. With
the NFS version 4 protocol, the client is able query what attributes
the server supports and construct requests with only those supported
attributes (or a subset thereof).
To this end, attributes are divided into three groups: mandatory,
recommended, and named. Both mandatory and recommended attributes
are supported in the NFS version 4 protocol by a specific and well-
defined encoding and are identified by number. They are requested by
setting a bit in the bit vector sent in the GETATTR request; the
server response includes a bit vector to list what attributes were
returned in the response. New mandatory or recommended attributes
may be added to the NFS protocol between major revisions by
publishing a standards-track RFC which allocates a new attribute
number value and defines the encoding for the attribute. See the
section "Minor Versioning" for further discussion.
Named attributes are accessed by the new OPENATTR operation, which
accesses a hidden directory of attributes associated with a file
system object. OPENATTR takes a filehandle for the object and
returns the filehandle for the attribute hierarchy. The filehandle
for the named attributes is a directory object accessible by LOOKUP
or READDIR and contains files whose names represent the named
attributes and whose data bytes are the value of the attribute. For
example:
+----------+-----------+---------------------------------+
| LOOKUP | "foo" | ; look up file |
| GETATTR | attrbits | |
| OPENATTR | | ; access foo's named attributes |
| LOOKUP | "x11icon" | ; look up specific attribute |
| READ | 0,4096 | ; read stream of bytes |
+----------+-----------+---------------------------------+
Named attributes are intended for data needed by applications rather
than by an NFS client implementation. NFS implementors are strongly
encouraged to define their new attributes as recommended attributes
by bringing them to the IETF standards-track process.
The set of attributes which are classified as mandatory is
deliberately small since servers must do whatever it takes to support
them. A server should support as many of the recommended attributes
as possible but by their definition, the server is not required to
support all of them. Attributes are deemed mandatory if the data is
both needed by a large number of clients and is not otherwise
reasonably computable by the client when support is not provided on
the server.
Note that the hidden directory returned by OPENATTR is a convenience
for protocol processing. The client should not make any assumptions
about the server's implementation of named attributes and whether the
underlying filesystem at the server has a named attribute directory
or not. Therefore, operations such as SETATTR and GETATTR on the
named attribute directory are undefined.
3.1. Mandatory Attributes
These MUST be supported by every NFS version 4 client and server in
order to ensure a minimum level of interoperability. The server must
store and return these attributes and the client must be able to
function with an attribute set limited to these attributes. With
just the mandatory attributes some client functionality may be
impaired or limited in some ways. A client may ask for any of these
attributes to be returned by setting a bit in the GETATTR request and
the server must return their value.
3.2. Recommended Attributes
These attributes are understood well enough to warrant support in the
NFS version 4 protocol. However, they may not be supported on all
clients and servers. A client may ask for any of these attributes to
be returned by setting a bit in the GETATTR request but must handle
the case where the server does not return them. A client may ask for
the set of attributes the server supports and should not request
attributes the server does not support. A server should be tolerant
of requests for unsupported attributes and simply not return them
rather than considering the request an error. It is expected that
servers will support all attributes they comfortably can and only
fail to support attributes which are difficult to support in their
operating environments. A server should provide attributes whenever
they don't have to "tell lies" to the client. For example, a file
modification time should be either an accurate time or should not be
supported by the server. This will not always be comfortable to
clients but the client is better positioned decide whether and how to
fabricate or construct an attribute or whether to do without the
attribute.
3.3. Named Attributes
These attributes are not supported by direct encoding in the NFS
Version 4 protocol but are accessed by string names rather than
numbers and correspond to an uninterpreted stream of bytes which are
stored with the filesystem object. The name space for these
attributes may be accessed by using the OPENATTR operation. The
OPENATTR operation returns a filehandle for a virtual "attribute
directory" and further perusal of the name space may be done using
READDIR and LOOKUP operations on this filehandle. Named attributes
may then be examined or changed by normal READ and WRITE and CREATE
operations on the filehandles returned from READDIR and LOOKUP.
Named attributes may have attributes.
It is recommended that servers support arbitrary named attributes. A
client should not depend on the ability to store any named attributes
in the server's filesystem. If a server does support named
attributes, a client which is also able to handle them should be able
to copy a file's data and meta-data with complete transparency from
one location to another; this would imply that names allowed for
regular directory entries are valid for named attribute names as
well.
Names of attributes will not be controlled by this document or other
IETF standards track documents. See the section "IANA
Considerations" for further discussion.
3.4. Classification of Attributes
Each of the Mandatory and Recommended attributes can be classified in
one of three categories: per server, per filesystem, or per
filesystem object. Note that it is possible that some per filesystem
attributes may vary within the filesystem. See the "homogeneous"
attribute for its definition. Note that the attributes
time_access_set and time_modify_set are not listed in this section
because they are write-only attributes corresponding to time_access
and time_modify, and are used in a special instance of SETATTR.
o The per server attribute is:
lease_time
o The per filesystem attributes are:
supp_attr, fh_expire_type, link_support, symlink_support,
unique_handles, aclsupport, cansettime, case_insensitive,
case_preserving, chown_restricted, files_avail, files_free,
files_total, fs_locations, homogeneous, maxfilesize, maxname,
maxread, maxwrite, no_trunc, space_avail, space_free,
space_total, time_delta, fs_layouttype, send_impl_id,
recv_impl_id
o The per filesystem object attributes are:
type, change, size, named_attr, fsid, rdattr_error, filehandle,
ACL, archive, fileid, hidden, maxlink, mimetype, mode,
numlinks, owner, owner_group, rawdev, space_used, system,
time_access, time_backup, time_create, time_metadata,
time_modify, mounted_on_fileid, layouttype, layouthint,
layout_blksize, layout_alignment
For quota_avail_hard, quota_avail_soft, and quota_used see their
definitions below for the appropriate classification.
3.5. Mandatory Attributes - Definitions
+-----------------+----+------------+--------+----------------------+
| name | # | Data Type | Access | Description |
+-----------------+----+------------+--------+----------------------+
| supp_attr | 0 | bitmap | READ | The bit vector which |
| | | | | would retrieve all |
| | | | | mandatory and |
| | | | | recommended |
| | | | | attributes that are |
| | | | | supported for this |
| | | | | object. The scope of |
| | | | | this attribute |
| | | | | applies to all |
| | | | | objects with a |
| | | | | matching fsid. |
| type | 1 | nfs4_ftype | READ | The type of the |
| | | | | object (file, |
| | | | | directory, symlink, |
| | | | | etc.) |
| fh_expire_type | 2 | uint32 | READ | Server uses this to |
| | | | | specify filehandle |
| | | | | expiration behavior |
| | | | | to the client. See |
| | | | | the section |
| | | | | "Filehandles" for |
| | | | | additional |
| | | | | description. |
| change | 3 | uint64 | READ | A value created by |
| | | | | the server that the |
| | | | | client can use to |
| | | | | determine if file |
| | | | | data, directory |
| | | | | contents or |
| | | | | attributes of the |
| | | | | object have been |
| | | | | modified. The server |
| | | | | may return the |
| | | | | object's |
| | | | | time_metadata |
| | | | | attribute for this |
| | | | | attribute's value |
| | | | | but only if the |
| | | | | filesystem object |
| | | | | can not be updated |
| | | | | more frequently than |
| | | | | the resolution of |
| | | | | time_metadata. |
| size | 4 | uint64 | R/W | The size of the |
| | | | | object in bytes. |
| link_support | 5 | bool | READ | True, if the |
| | | | | object's filesystem |
| | | | | supports hard links. |
| symlink_support | 6 | bool | READ | True, if the |
| | | | | object's filesystem |
| | | | | supports symbolic |
| | | | | links. |
| named_attr | 7 | bool | READ | True, if this object |
| | | | | has named |
| | | | | attributes. In other |
| | | | | words, object has a |
| | | | | non-empty named |
| | | | | attribute directory. |
| fsid | 8 | fsid4 | READ | Unique filesystem |
| | | | | identifier for the |
| | | | | filesystem holding |
| | | | | this object. fsid |
| | | | | contains major and |
| | | | | minor components |
| | | | | each of which are |
| | | | | uint64. |
| unique_handles | 9 | bool | READ | True, if two |
| | | | | distinct filehandles |
| | | | | guaranteed to refer |
| | | | | to two different |
| | | | | filesystem objects. |
| lease_time | 10 | nfs_lease4 | READ | Duration of leases |
| | | | | at server in |
| | | | | seconds. |
| rdattr_error | 11 | enum | READ | Error returned from |
| | | | | getattr during |
| | | | | readdir. |
| filehandle | 19 | nfs_fh4 | READ | The filehandle of |
| | | | | this object |
| | | | | (primarily for |
| | | | | readdir requests). |
+-----------------+----+------------+--------+----------------------+
3.6. Recommended Attributes - Definitions
+--------------------+-----+--------------+--------+----------------+
| name | # | Data Type | Access | Description |
+--------------------+-----+--------------+--------+----------------+
| ACL | 12 | nfsace4<> | R/W | The access |
| | | | | control list |
| | | | | for the |
| | | | | object. |
| aclsupport | 13 | uint32 | READ | Indicates what |
| | | | | types of ACLs |
| | | | | are supported |
| | | | | on the current |
| | | | | filesystem. |
| archive | 14 | bool | R/W | True, if this |
| | | | | file has been |
| | | | | archived since |
| | | | | the time of |
| | | | | last |
| | | | | modification |
| | | | | (deprecated in |
| | | | | favor of |
| | | | | time_backup). |
| cansettime | 15 | bool | READ | True, if the |
| | | | | server able to |
| | | | | change the |
| | | | | times for a |
| | | | | filesystem |
| | | | | object as |
| | | | | specified in a |
| | | | | SETATTR |
| | | | | operation. |
| case_insensitive | 16 | bool | READ | True, if |
| | | | | filename |
| | | | | comparisons on |
| | | | | this |
| | | | | filesystem are |
| | | | | case |
| | | | | insensitive. |
| case_preserving | 17 | bool | READ | True, if |
| | | | | filename case |
| | | | | on this |
| | | | | filesystem are |
| | | | | preserved. |
| chown_restricted | 18 | bool | READ | If TRUE, the |
| | | | | server will |
| | | | | reject any |
| | | | | request to |
| | | | | change either |
| | | | | the owner or |
| | | | | the group |
| | | | | associated |
| | | | | with a file if |
| | | | | the caller is |
| | | | | not a |
| | | | | privileged |
| | | | | user (for |
| | | | | example, |
| | | | | "root" in UNIX |
| | | | | operating |
| | | | | environments |
| | | | | or in Windows |
| | | | | 2000 the "Take |
| | | | | Ownership" |
| | | | | privilege). |
| fileid | 20 | uint64 | READ | A number |
| | | | | uniquely |
| | | | | identifying |
| | | | | the file |
| | | | | within the |
| | | | | filesystem. |
| files_avail | 21 | uint64 | READ | File slots |
| | | | | available to |
| | | | | this user on |
| | | | | the filesystem |
| | | | | containing |
| | | | | this object - |
| | | | | this should be |
| | | | | the smallest |
| | | | | relevant |
| | | | | limit. |
| files_free | 22 | uint64 | READ | Free file |
| | | | | slots on the |
| | | | | filesystem |
| | | | | containing |
| | | | | this object - |
| | | | | this should be |
| | | | | the smallest |
| | | | | relevant |
| | | | | limit. |
| files_total | 23 | uint64 | READ | Total file |
| | | | | slots on the |
| | | | | filesystem |
| | | | | containing |
| | | | | this object. |
| fs_locations | 24 | fs_locations | READ | Locations |
| | | | | where this |
| | | | | filesystem may |
| | | | | be found. If |
| | | | | the server |
| | | | | returns |
| | | | | NFS4ERR_MOVED |
| | | | | as an error, |
| | | | | this attribute |
| | | | | MUST be |
| | | | | supported. |
| hidden | 25 | bool | R/W | True, if the |
| | | | | file is |
| | | | | considered |
| | | | | hidden with |
| | | | | respect to the |
| | | | | Windows API? |
| homogeneous | 26 | bool | READ | True, if this |
| | | | | object's |
| | | | | filesystem is |
| | | | | homogeneous, |
| | | | | i.e. are per |
| | | | | filesystem |
| | | | | attributes the |
| | | | | same for all |
| | | | | filesystem's |
| | | | | objects. |
| maxfilesize | 27 | uint64 | READ | Maximum |
| | | | | supported file |
| | | | | size for the |
| | | | | filesystem of |
| | | | | this object. |
| maxlink | 28 | uint32 | READ | Maximum number |
| | | | | of links for |
| | | | | this object. |
| maxname | 29 | uint32 | READ | Maximum |
| | | | | filename size |
| | | | | supported for |
| | | | | this object. |
| maxread | 30 | uint64 | READ | Maximum read |
| | | | | size supported |
| | | | | for this |
| | | | | object. |
| maxwrite | 31 | uint64 | READ | Maximum write |
| | | | | size supported |
| | | | | for this |
| | | | | object. This |
| | | | | attribute |
| | | | | SHOULD be |
| | | | | supported if |
| | | | | the file is |
| | | | | writable. Lack |
| | | | | of this |
| | | | | attribute can |
| | | | | lead to the |
| | | | | client either |
| | | | | wasting |
| | | | | bandwidth or |
| | | | | not receiving |
| | | | | the best |
| | | | | performance. |
| mimetype | 32 | utf8<> | R/W | MIME body |
| | | | | type/subtype |
| | | | | of this |
| | | | | object. |
| mode | 33 | mode4 | R/W | UNIX-style |
| | | | | mode and |
| | | | | permission |
| | | | | bits for this |
| | | | | object. |
| no_trunc | 34 | bool | READ | True, if a |
| | | | | name longer |
| | | | | than name_max |
| | | | | is used, an |
| | | | | error be |
| | | | | returned and |
| | | | | name is not |
| | | | | truncated. |
| numlinks | 35 | uint32 | READ | Number of hard |
| | | | | links to this |
| | | | | object. |
| owner | 36 | utf8<> | R/W | The string |
| | | | | name of the |
| | | | | owner of this |
| | | | | object. |
| owner_group | 37 | utf8<> | R/W | The string |
| | | | | name of the |
| | | | | group |
| | | | | ownership of |
| | | | | this object. |
| quota_avail_hard | 38 | uint64 | READ | For definition |
| | | | | see "Quota |
| | | | | Attributes" |
| | | | | section below. |
| quota_avail_soft | 39 | uint64 | READ | For definition |
| | | | | see "Quota |
| | | | | Attributes" |
| | | | | section below. |
| quota_used | 40 | uint64 | READ | For definition |
| | | | | see "Quota |
| | | | | Attributes" |
| | | | | section below. |
| rawdev | 41 | specdata4 | READ | Raw device |
| | | | | identifier. |
| | | | | UNIX device |
| | | | | major/minor |
| | | | | node |
| | | | | information. |
| | | | | If the value |
| | | | | of type is not |
| | | | | NF4BLK or |
| | | | | NF4CHR, the |
| | | | | value return |
| | | | | SHOULD NOT be |
| | | | | considered |
| | | | | useful. |
| space_avail | 42 | uint64 | READ | Disk space in |
| | | | | bytes |
| | | | | available to |
| | | | | this user on |
| | | | | the filesystem |
| | | | | containing |
| | | | | this object - |
| | | | | this should be |
| | | | | the smallest |
| | | | | relevant |
| | | | | limit. |
| space_free | 43 | uint64 | READ | Free disk |
| | | | | space in bytes |
| | | | | on the |
| | | | | filesystem |
| | | | | containing |
| | | | | this object - |
| | | | | this should be |
| | | | | the smallest |
| | | | | relevant |
| | | | | limit. |
| space_total | 44 | uint64 | READ | Total disk |
| | | | | space in bytes |
| | | | | on the |
| | | | | filesystem |
| | | | | containing |
| | | | | this object. |
| space_used | 45 | uint64 | READ | Number of |
| | | | | filesystem |
| | | | | bytes |
| | | | | allocated to |
| | | | | this object. |
| system | 46 | bool | R/W | True, if this |
| | | | | file is a |
| | | | | "system" file |
| | | | | with respect |
| | | | | to the Windows |
| | | | | API? |
| time_access | 47 | nfstime4 | READ | The time of |
| | | | | last access to |
| | | | | the object by |
| | | | | a read that |
| | | | | was satisfied |
| | | | | by the server. |
| time_access_set | 48 | settime4 | WRITE | Set the time |
| | | | | of last access |
| | | | | to the object. |
| | | | | SETATTR use |
| | | | | only. |
| time_backup | 49 | nfstime4 | R/W | The time of |
| | | | | last backup of |
| | | | | the object. |
| time_create | 50 | nfstime4 | R/W | The time of |
| | | | | creation of |
| | | | | the object. |
| | | | | This attribute |
| | | | | does not have |
| | | | | any relation |
| | | | | to the |
| | | | | traditional |
| | | | | UNIX file |
| | | | | attribute |
| | | | | "ctime" or |
| | | | | "change time". |
| time_delta | 51 | nfstime4 | READ | Smallest |
| | | | | useful server |
| | | | | time |
| | | | | granularity. |
| time_metadata | 52 | nfstime4 | READ | The time of |
| | | | | last meta-data |
| | | | | modification |
| | | | | of the object. |
| time_modify | 53 | nfstime4 | READ | The time of |
| | | | | last |
| | | | | modification |
| | | | | to the object. |
| time_modify_set | 54 | settime4 | WRITE | Set the time |
| | | | | of last |
| | | | | modification |
| | | | | to the object. |
| | | | | SETATTR use |
| | | | | only. |
| mounted_on_fileid | 55 | uint64 | READ | Like fileid, |
| | | | | but if the |
| | | | | target |
| | | | | filehandle is |
| | | | | the root of a |
| | | | | filesystem |
| | | | | return the |
| | | | | fileid of the |
| | | | | underlying |
| | | | | directory. |
| send_impl_id | TBD | impl_ident4 | WRITE | Client |
| | | | | provides |
| | | | | server with |
| | | | | implementation |
| | | | | identity via |
| | | | | SETATTR. |
| recv_impl_id | TBD | nfs_impl_id4 | READ | Client obtains |
| | | | | server |
| | | | | implementation |
| | | | | via GETATTR. |
| dir_notif_delay | TBD | R/W | READ | notification |
| | | | | delays on |
| | | | | directory |
| | | | | attributes |
| dirent_notif_delay | TBD | R/W | READ | notification |
| | | | | delays on |
| | | | | child |
| | | | | attributes |
| fs_layouttype | TBD | layouttype4 | READ | Layout types |
| | | | | available for |
| | | | | the |
| | | | | filesystem. |
| layouttype | TBD | layouttype4 | READ | Layout types |
| | | | | available for |
| | | | | the file. |
| layouthint | TBD | layouthint4 | WRITE | Client |
| | | | | specified hint |
| | | | | for file |
| | | | | layout. |
| layout_blksize | TBD | uint32_t | READ | Preferred |
| | | | | block size for |
| | | | | layout related |
| | | | | I/O. |
| layout_alignment | TBD | uint32_t | READ | Preferred |
| | | | | alignment for |
| | | | | layout related |
| | | | | I/O. |
| | TBD | | READ | desc |
| | TBD | | READ | desc |
+--------------------+-----+--------------+--------+----------------+
3.7. Time Access
As defined above, the time_access attribute represents the time of
last access to the object by a read that was satisfied by the server.
The notion of what is an "access" depends on server's operating
environment and/or the server's filesystem semantics. For example,
for servers obeying POSIX semantics, time_access would be updated
only by the READLINK, READ, and READDIR operations and not any of the
operations that modify the content of the object. Of course, setting
the corresponding time_access_set attribute is another way to modify
the time_access attribute.
Whenever the file object resides on a writable filesystem, the server
should make best efforts to record time_access into stable storage.
However, to mitigate the performance effects of doing so, and most
especially whenever the server is satisfying the read of the object's
content from its cache, the server MAY cache access time updates and
lazily write them to stable storage. It is also acceptable to give
administrators of the server the option to disable time_access
updates.
3.8. Interpreting owner and owner_group
The recommended attributes "owner" and "owner_group" (and also users
and groups within the "acl" attribute) are represented in terms of a
UTF-8 string. To avoid a representation that is tied to a particular
underlying implementation at the client or server, the use of the
UTF-8 string has been chosen. Note that section 6.1 of [RFC2624]
provides additional rationale. It is expected that the client and
server will have their own local representation of owner and
owner_group that is used for local storage or presentation to the end
user. Therefore, it is expected that when these attributes are
transferred between the client and server that the local
representation is translated to a syntax of the form "user@
dns_domain". This will allow for a client and server that do not use
the same local representation the ability to translate to a common
syntax that can be interpreted by both.
Similarly, security principals may be represented in different ways
by different security mechanisms. Servers normally translate these
representations into a common format, generally that used by local
storage, to serve as a means of identifying the users corresponding
to these security principals. When these local identifiers are
translated to the form of the owner attribute, associated with files
created by such principals they identify, in a common format, the
users associated with each corresponding set of security principals.
The translation used to interpret owner and group strings is not
specified as part of the protocol. This allows various solutions to
be employed. For example, a local translation table may be consulted
that maps between a numeric id to the user@dns_domain syntax. A name
service may also be used to accomplish the translation. A server may
provide a more general service, not limited by any particular
translation (which would only translate a limited set of possible
strings) by storing the owner and owner_group attributes in local
storage without any translation or it may augment a translation
method by storing the entire string for attributes for which no
translation is available while using the local representation for
those cases in which a translation is available.
Servers that do not provide support for all possible values of the
owner and owner_group attributes, should return an error
(NFS4ERR_BADOWNER) when a string is presented that has no
translation, as the value to be set for a SETATTR of the owner,
owner_group, or acl attributes. When a server does accept an owner
or owner_group value as valid on a SETATTR (and similarly for the
owner and group strings in an acl), it is promising to return that
same string when a corresponding GETATTR is done. Configuration
changes and ill-constructed name translations (those that contain
aliasing) may make that promise impossible to honor. Servers should
make appropriate efforts to avoid a situation in which these
attributes have their values changed when no real change to ownership
has occurred.
The "dns_domain" portion of the owner string is meant to be a DNS
domain name. For example, user@ietf.org. Servers should accept as
valid a set of users for at least one domain. A server may treat
other domains as having no valid translations. A more general
service is provided when a server is capable of accepting users for
multiple domains, or for all domains, subject to security
constraints.
In the case where there is no translation available to the client or
server, the attribute value must be constructed without the "@".
Therefore, the absence of the @ from the owner or owner_group
attribute signifies that no translation was available at the sender
and that the receiver of the attribute should not use that string as
a basis for translation into its own internal format. Even though
the attribute value can not be translated, it may still be useful.
In the case of a client, the attribute string may be used for local
display of ownership.
To provide a greater degree of compatibility with previous versions
of NFS (i.e. v2 and v3), which identified users and groups by 32-bit
unsigned uid's and gid's, owner and group strings that consist of
decimal numeric values with no leading zeros can be given a special
interpretation by clients and servers which choose to provide such
support. The receiver may treat such a user or group string as
representing the same user as would be represented by a v2/v3 uid or
gid having the corresponding numeric value. A server is not
obligated to accept such a string, but may return an NFS4ERR_BADOWNER
instead. To avoid this mechanism being used to subvert user and
group translation, so that a client might pass all of the owners and
groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER
error when there is a valid translation for the user or owner
designated in this way. In that case, the client must use the
appropriate name@domain string and not the special form for
compatibility.
The owner string "nobody" may be used to designate an anonymous user,
which will be associated with a file created by a security principal
that cannot be mapped through normal means to the owner attribute.
3.9. Character Case Attributes
With respect to the case_insensitive and case_preserving attributes,
each UCS-4 character (which UTF-8 encodes) has a "long descriptive
name" [RFC1345] which may or may not included the word "CAPITAL" or
"SMALL". The presence of SMALL or CAPITAL allows an NFS server to
implement unambiguous and efficient table driven mappings for case
insensitive comparisons, and non-case-preserving storage. For
general character handling and internationalization issues, see the
section "Internationalization".
3.10. Quota Attributes
For the attributes related to filesystem quotas, the following
definitions apply:
quota_avail_soft The value in bytes which represents the amount of
additional disk space that can be allocated to this file or
directory before the user may reasonably be warned. It is
understood that this space may be consumed by allocations to other
files or directories though there is a rule as to which other
files or directories.
quota_avail_hard The value in bytes which represent the amount of
additional disk space beyond the current allocation that can be
allocated to this file or directory before further allocations
will be refused. It is understood that this space may be consumed
by allocations to other files or directories.
quota_used The value in bytes which represent the amount of disc
space used by this file or directory and possibly a number of
other similar files or directories, where the set of "similar"
meets at least the criterion that allocating space to any file or
directory in the set will reduce the "quota_avail_hard" of every
other file or directory in the set.
Note that there may be a number of distinct but overlapping sets
of files or directories for which a quota_used value is
maintained. E.g. "all files with a given owner", "all files with
a given group owner". etc.
The server is at liberty to choose any of those sets but should do
so in a repeatable way. The rule may be configured per-filesystem
or may be "choose the set with the smallest quota".
3.11. mounted_on_fileid
UNIX-based operating environments connect a filesystem into the
namespace by connecting (mounting) the filesystem onto the existing
file object (the mount point, usually a directory) of an existing
filesystem. When the mount point's parent directory is read via an
API like readdir(), the return results are directory entries, each
with a component name and a fileid. The fileid of the mount point's
directory entry will be different from the fileid that the stat()
system call returns. The stat() system call is returning the fileid
of the root of the mounted filesystem, whereas readdir() is returning
the fileid stat() would have returned before any filesystems were
mounted on the mount point.
Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request
to cross other filesystems. The client detects the filesystem
crossing whenever the filehandle argument of LOOKUP has an fsid
attribute different from that of the filehandle returned by LOOKUP.
A UNIX-based client will consider this a "mount point crossing".
UNIX has a legacy scheme for allowing a process to determine its
current working directory. This relies on readdir() of a mount
point's parent and stat() of the mount point returning fileids as
previously described. The mounted_on_fileid attribute corresponds to
the fileid that readdir() would have returned as described
previously.
While the NFS version 4 client could simply fabricate a fileid
corresponding to what mounted_on_fileid provides (and if the server
does not support mounted_on_fileid, the client has no choice), there
is a risk that the client will generate a fileid that conflicts with
one that is already assigned to another object in the filesystem.
Instead, if the server can provide the mounted_on_fileid, the
potential for client operational problems in this area is eliminated.
If the server detects that there is no mounted point at the target
file object, then the value for mounted_on_fileid that it returns is
the same as that of the fileid attribute.
The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD
provide it if possible, and for a UNIX-based server, this is
straightforward. Usually, mounted_on_fileid will be requested during
a READDIR operation, in which case it is trivial (at least for UNIX-
based servers) to return mounted_on_fileid since it is equal to the
fileid of a directory entry returned by readdir(). If
mounted_on_fileid is requested in a GETATTR operation, the server
should obey an invariant that has it returning a value that is equal
to the file object's entry in the object's parent directory, i.e.
what readdir() would have returned. Some operating environments
allow a series of two or more filesystems to be mounted onto a single
mount point. In this case, for the server to obey the aforementioned
invariant, it will need to find the base mount point, and not the
intermediate mount points.
3.12. send_impl_id and recv_impl_id
These recommended attributes are used to identify the client and
server. In the case of the send_impl_id attribute, the client sends
its clientid4 value along with the nfs_impl_id4. The use of the
clientid4 value allows the server to identify and match specific
client interaction. In the case of the recv_impl_id attribute, the
client receives the nfs_impl_id4 value.
Access to this identification information can be most useful at both
client and server. Being able to identify specific implementations
can help in planning by administrators or implementers. For example,
diagnostic software may extract this information in an attempt to
identify implementation problems, performance workload behaviors or
general usage statistics. Since the intent of having access to this
information is for planning or general diagnosis only, the client and
server MUST NOT interpret this implementation identity information in
a way that affects interoperational behavior of the implementation.
The reason is the if clients and servers did such a thing, they might
use fewer capabilities of the protocol than the peer can support, or
the client and server might refuse to interoperate.
Because it is likely some implementations will violate the protocol
specification and interpret the identity information, implementations
MUST allow the users of the NFSv4 client and server to set the
contents of the sent nfs_impl_id structure to any value.
Even though these attributes are recommended, if the server supports
one of them it MUST support the other.
3.13. fs_layouttype
This attribute applies to a file system and indicates what layout
types are supported by the file system. We expect this attribute to
be queried when a client encounters a new fsid. This attribute is
used by the client to determine if it has applicable layout drivers.
3.14. layouttype
This attribute indicates the particular layout type(s) used for a
file. This is for informational purposes only. The client needs to
use the LAYOUTGET operation in order to get enough information (e.g.,
specific device information) in order to perform I/O.
3.15. layouthint
This attribute may be set on newly created files to influence the
metadata server's choice for the file's layout. It is suggested that
this attribute is set as one of the initial attributes within the
OPEN call. The metadata server may ignore this attribute. This
attribute is a sub-set of the layout structure returned by LAYOUTGET.
For example, instead of specifying particular devices, this would be
used to suggest the stripe width of a file. It is up to the server
implementation to determine which fields within the layout it uses.
[[Comment.3: it has been suggested that the HINT is a well defined
type other than pnfs_layoutdata4, similar to pnfs_layoutupdate4.]]
3.16. Access Control Lists
The NFS version 4 ACL attribute is an array of access control entries
(ACE). Although, the client can read and write the ACL attribute,
the NFSv4 model is the server does all access control based on the
server's interpretation of the ACL. If at any point the client wants
to check access without issuing an operation that modifies or reads
data or metadata, the client can use the OPEN and ACCESS operations
to do so. There are various access control entry types, as defined
in Section 3.16.1. The server is able to communicate which ACE types
are supported by returning the appropriate value within the
aclsupport attribute. Each ACE covers one or more operations on a
file or directory as described in Section 3.16.2. It may also
contain one or more flags that modify the semantics of the ACE as
defined in Section 3.16.3.
The NFS ACE attribute is defined as follows:
typedef uint32_t acetype4;
typedef uint32_t aceflag4;
typedef uint32_t acemask4;
struct nfsace4 {
acetype4 type;
aceflag4 flag;
acemask4 access_mask;
utf8str_mixed who;
};
To determine if a request succeeds, each nfsace4 entry is processed
in order by the server. Only ACEs which have a "who" that matches
the requester are considered. Each ACE is processed until all of the
bits of the requester's access have been ALLOWED. Once a bit (see
below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer
considered in the processing of later ACEs. If an ACCESS_DENIED_ACE
is encountered where the requester's access still has unALLOWED bits
in common with the "access_mask" of the ACE, the request is denied.
However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT
ACE types do not affect a requester's access, and instead are for
triggering events as a result of a requester's access attempt.
Therefore, all AUDIT and ALARM ACEs are processed until end of the
ACL. When the ACL is fully processed, if there are bits in
requester's mask that have not been considered whether the server
allows or denies, the access is denied. Even though a request is
denied, servers may choose to have other restrictions or
implementation defined security policies in place. In those cases,
access may be decided outside of what is in the ACL. Examples of
such security policies or restrictions are:
o The owner of the file will always be able granted ACE4_WRITE_ACL
and ACE4_READ_ACL permissions. This would prevent the user from
getting into the situation where they can't ever modify the ACL.
o The ACL may say that an entity is to be granted ACE4_WRITE_DATA
permission, but the file system is mounted read only, therefore
write access is denied.
As mentioned before, this is one of the reasons that client
implementations are not recommended to do their own access checking.
The NFS version 4 ACL model is quite rich. Some server platforms may
provide access control functionality that goes beyond the UNIX-style
mode attribute, but which is not as rich as the NFS ACL model. So
that users can take advantage of this more limited functionality, the
server may indicate that it supports ACLs as long as it follows the
guidelines for mapping between its ACL model and the NFS version 4
ACL model.
The situation is complicated by the fact that a server may have
multiple modules that enforce ACLs. For example, the enforcement for
NFS version 4 access may be different from the enforcement for local
access, and both may be different from the enforcement for access
through other protocols such as SMB. So it may be useful for a
server to accept an ACL even if not all of its modules are able to
support it.
The guiding principle in all cases is that the server must not accept
ACLs that appear to make the file more secure than it really is.
3.16.1. ACE type
Type Description
_____________________________________________________
ALLOW Explicitly grants the access defined in
acemask4 to the file or directory.
DENY Explicitly denies the access defined in
acemask4 to the file or directory.
AUDIT LOG (system dependent) any access
attempt to a file or directory which
uses any of the access methods specified
in acemask4.
ALARM Generate a system ALARM (system
dependent) when any access attempt is
made to a file or directory for the
access methods specified in acemask4.
A server need not support all of the above ACE types. The bitmask
constants used to represent the above definitions within the
aclsupport attribute are as follows:
const ACL4_SUPPORT_ALLOW_ACL = 0x00000001;
const ACL4_SUPPORT_DENY_ACL = 0x00000002;
const ACL4_SUPPORT_AUDIT_ACL = 0x00000004;
const ACL4_SUPPORT_ALARM_ACL = 0x00000008;
The semantics of the "type" field follow the descriptions provided
above.
The constants used for the type field (acetype4) are as follows:
const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000;
const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001;
const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002;
const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003;
Clients should not attempt to set an ACE unless the server claims
support for that ACE type. If the server receives a request to set
an ACE that it cannot store, it MUST reject the request with
NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE
that it can store but cannot enforce, the server SHOULD reject the
request with NFS4ERR_ATTRNOTSUPP.
Example: suppose a server can enforce NFS ACLs for NFS access but
cannot enforce ACLs for local access. If arbitrary processes can run
on the server, then the server SHOULD NOT indicate ACL support. On
the other hand, if only trusted administrative programs run locally,
then the server may indicate ACL support.
3.16.2. ACE Access Mask
The access_mask field contains values based on the following:
ACE4_READ_DATA
Operation(s) affected:
READ
OPEN
Discussion:
Permission to read the data of the file.
ACE4_LIST_DIRECTORY
Operation(s) affected:
READDIR
Discussion:
Permission to list the contents of a directory.
ACE4_WRITE_DATA
Operation(s) affected:
WRITE
OPEN
Discussion:
Permission to modify a file's data anywhere in the file's
offset range. This includes the ability to write to any
arbitrary offset and as a result to grow the file.
ACE4_ADD_FILE
Operation(s) affected:
CREATE
OPEN
Discussion:
Permission to add a new file in a directory. The CREATE
operation is affected when nfs_ftype4 is NF4LNK, NF4BLK,
NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because
it is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected
when used to create a regular file.
ACE4_APPEND_DATA
Operation(s) affected:
WRITE
OPEN
Discussion:
The ability to modify a file's data, but only starting at
EOF. This allows for the notion of append-only files, by
allowing ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to
the same user or group. If a file has an ACL such as the
one described above and a WRITE request is made for
somewhere other than EOF, the server SHOULD return
NFS4ERR_ACCESS.
ACE4_ADD_SUBDIRECTORY
Operation(s) affected:
CREATE
Discussion:
Permission to create a subdirectory in a directory. The
CREATE operation is affected when nfs_ftype4 is NF4DIR.
ACE4_READ_NAMED_ATTRS
Operation(s) affected:
OPENATTR
Discussion:
Permission to read the named attributes of a file or to
lookup the named attributes directory. OPENATTR is
affected when it is not used to create a named attribute
directory. This is when 1.) createdir is TRUE, but a
named attribute directory already exists, or 2.) createdir
is FALSE.
ACE4_WRITE_NAMED_ATTRS
Operation(s) affected:
OPENATTR
Discussion:
Permission to write the named attributes of a file or
to create a named attribute directory. OPENATTR is
affected when it is used to create a named attribute
directory. This is when createdir is TRUE and no named
attribute directory exists. The ability to check whether
or not a named attribute directory exists depends on the
ability to look it up, therefore, users also need the
ACE4_READ_NAMED_ATTRS permission in order to create a
named attribute directory.
ACE4_EXECUTE
Operation(s) affected:
LOOKUP
Discussion:
Permission to execute a file or traverse/search a
directory.
ACE4_DELETE_CHILD
Operation(s) affected:
REMOVE
Discussion:
Permission to delete a file or directory within a
directory. See section "ACE4_DELETE vs.
ACE4_DELETE_CHILD" for information on how these two access
mask bits interact.
ACE4_READ_ATTRIBUTES
Operation(s) affected:
GETATTR of file system object attributes
Discussion:
The ability to read basic attributes (non-ACLs) of a file.
On a UNIX system, basic attributes can be thought of as
the stat level attributes. Allowing this access mask bit
would mean the entity can execute "ls -l" and stat.
ACE4_WRITE_ATTRIBUTES
Operation(s) affected:
SETATTR of time_access_set, time_backup,
time_create, time_modify_set
Discussion:
Permission to change the times associated with a file
or directory to an arbitrary value. A user having
ACE4_WRITE_DATA permission, but lacking
ACE4_WRITE_ATTRIBUTES must be allowed to implicitly set
the times associated with a file.
ACE4_DELETE
Operation(s) affected:
REMOVE
Discussion:
Permission to delete the file or directory. See section
"ACE4_DELETE vs. ACE4_DELETE_CHILD" for information on how
these two access mask bits interact.
ACE4_READ_ACL
Operation(s) affected:
GETATTR of acl
Discussion:
Permission to read the ACL.
ACE4_WRITE_ACL
Operation(s) affected:
SETATTR of acl and mode
Discussion:
Permission to write the acl and mode attributes.
ACE4_WRITE_OWNER
Operation(s) affected:
SETATTR of owner and owner_group
Discussions:
Permission to write the owner and owner_group attributes.
On UNIX systems, this is the ability to execute chown or
chgrp.
ACE4_SYNCHRONIZE
Operation(s) affected:
NONE
Discussion:
Permission to access file locally at the server with
synchronized reads and writes.
The bitmask constants used for the access mask field are as follows:
const ACE4_READ_DATA = 0x00000001;
const ACE4_LIST_DIRECTORY = 0x00000001;
const ACE4_WRITE_DATA = 0x00000002;
const ACE4_ADD_FILE = 0x00000002;
const ACE4_APPEND_DATA = 0x00000004;
const ACE4_ADD_SUBDIRECTORY = 0x00000004;
const ACE4_READ_NAMED_ATTRS = 0x00000008;
const ACE4_WRITE_NAMED_ATTRS = 0x00000010;
const ACE4_EXECUTE = 0x00000020;
const ACE4_DELETE_CHILD = 0x00000040;
const ACE4_READ_ATTRIBUTES = 0x00000080;
const ACE4_WRITE_ATTRIBUTES = 0x00000100;
const ACE4_DELETE = 0x00010000;
const ACE4_READ_ACL = 0x00020000;
const ACE4_WRITE_ACL = 0x00040000;
const ACE4_WRITE_OWNER = 0x00080000;
const ACE4_SYNCHRONIZE = 0x00100000;
Server implementations need not provide the granularity of control
that is implied by this list of masks. For example, POSIX-based
systems might not distinguish APPEND_DATA (the ability to append to a
file) from WRITE_DATA (the ability to modify existing contents); both
masks would be tied to a single "write" permission. When such a
server returns attributes to the client, it would show both
APPEND_DATA and WRITE_DATA if and only if the write permission is
enabled.
If a server receives a SETATTR request that it cannot accurately
implement, it should error in the direction of more restricted
access. For example, suppose a server cannot distinguish overwriting
data from appending new data, as described in the previous paragraph.
If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is
not (or vice versa), the server should reject the request with
NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type DENY, the
server may silently turn on the other bit, so that both APPEND_DATA
and WRITE_DATA are denied.
3.16.2.1. ACE4_DELETE vs. ACE4_DELETE_CHILD
There are two separate access mask bits that govern the ability to
delete a file: ACE4_DELETE and ACE4_DELETE_CHILD. ACE4_DELETE is
intended to be specified by the ACL for the object to be deleted, and
ACE4_DELETE_CHILD is intended to be specified by the ACL of the
parent directory.
In addition to ACE4_DELETE and ACE4_DELETE_CHILD, many systems also
consider the "sticky bit" (MODE4_SVTX) and the appropriate "write"
mode bit when determining whether to allow a file to be deleted. The
mode bit for write corresponds to ACE4_WRITE_DATA, which is the same
physical bit as ACE4_ADD_FILE. Therefore, ACE4_ADD_FILE can come
into play when determining permission to delete.
In the algorithm below, the strategy is that ACE4_DELETE and
ACE4_DELETE_CHILD take precedence over the sticky bit, and the sticky
bit takes precedence over the "write" mode bits (reflected in
ACE4_ADD_FILE).
Server implementations SHOULD grant or deny permission to delete
based on the following algorithm.
if ACE4_EXECUTE is denied by the parent directory ACL:
deny delete
else if ACE4_EXECUTE is unspecified by the parent
directory ACL:
deny delete
else if ACE4_DELETE is allowed by the target object ACL:
allow delete
else if ACE4_DELETE_CHILD is allowed by the parent
directory ACL:
allow delete
else if ACE4_DELETE_CHILD is denied by the
parent directory ACL:
deny delete
else if ACE4_ADD_FILE is allowed by the parent directory ACL:
if MODE4_SVTX is set for the parent directory:
if the principal owns the parent directory OR
the principal owns the target object OR
ACE4_WRITE_DATA is allowed by the target
object ACL:
allow delete
else:
deny delete
else:
allow delete
else:
deny delete
3.16.3. ACE flag
The "flag" field contains values based on the following descriptions.
ACE4_FILE_INHERIT_ACE
Can be placed on a directory and indicates that this ACE should be
added to each new non-directory file created.
ACE4_DIRECTORY_INHERIT_ACE
Can be placed on a directory and indicates that this ACE should be
added to each new directory created.
ACE4_INHERIT_ONLY_ACE
Can be placed on a directory but does not apply to the directory,
only to newly created files/directories as specified by the above
two flags.
ACE4_NO_PROPAGATE_INHERIT_ACE
Can be placed on a directory. Normally when a new directory is
created and an ACE exists on the parent directory which is marked
ACE4_DIRECTORY_INHERIT_ACE, two ACEs are placed on the new
directory. One for the directory itself and one which is an
inheritable ACE for newly created directories. This flag tells
the server to not place an ACE on the newly created directory
which is inheritable by subdirectories of the created directory.
ACE4_SUCCESSFUL_ACCESS_ACE_FLAG
ACE4_FAILED_ACCESS_ACE_FLAG
The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and
ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits relate only to
ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE
(ALARM) ACE types. If during the processing of the file's ACL,
the server encounters an AUDIT or ALARM ACE that matches the
principal attempting the OPEN, the server notes that fact, and the
presence, if any, of the SUCCESS and FAILED flags encountered in
the AUDIT or ALARM ACE. Once the server completes the ACL
processing, and the share reservation processing, and the OPEN
call, it then notes if the OPEN succeeded or failed. If the OPEN
succeeded, and if the SUCCESS flag was set for a matching AUDIT or
ALARM, then the appropriate AUDIT or ALARM event occurs. If the
OPEN failed, and if the FAILED flag was set for the matching AUDIT
or ALARM, then the appropriate AUDIT or ALARM event occurs.
Clearly either or both of the SUCCESS or FAILED can be set, but if
neither is set, the AUDIT or ALARM ACE is not useful.
The previously described processing applies to that of the ACCESS
operation as well. The difference being that "success" or
"failure" does not mean whether ACCESS returns NFS4_OK or not.
Success means whether ACCESS returns all requested and supported
bits. Failure means whether ACCESS failed to return a bit that
was requested and supported.
ACE4_IDENTIFIER_GROUP
Indicates that the "who" refers to a GROUP as defined under UNIX
or a GROUP ACCOUNT as defined under Windows. Clients and servers
may ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who value
equal to one of the special identifiers outlined in section "ACE
who".
The bitmask constants used for the flag field are as follows:
const ACE4_FILE_INHERIT_ACE = 0x00000001;
const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002;
const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004;
const ACE4_INHERIT_ONLY_ACE = 0x00000008;
const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010;
const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020;
const ACE4_IDENTIFIER_GROUP = 0x00000040;
A server need not support any of these flags. If the server supports
flags that are similar to, but not exactly the same as, these flags,
the implementation may define a mapping between the protocol-defined
flags and the implementation-defined flags. Again, the guiding
principle is that the file not appear to be more secure than it
really is.
For example, suppose a client tries to set an ACE with
ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the
server does not support any form of ACL inheritance, the server
should reject the request with NFS4ERR_ATTRNOTSUPP. If the server
supports a single "inherit ACE" flag that applies to both files and
directories, the server may reject the request (i.e., requiring the
client to set both the file and directory inheritance flags). The
server may also accept the request and silently turn on the
ACE4_DIRECTORY_INHERIT_ACE flag.
3.16.4. ACE who
There are several special identifiers ("who") which need to be
understood universally, rather than in the context of a particular
DNS domain. Some of these identifiers cannot be understood when an
NFS client accesses the server, but have meaning when a local process
accesses the file. The ability to display and modify these
permissions is permitted over NFS, even if none of the access methods
on the server understands the identifiers.
Who Description
_______________________________________________________________
"OWNER" The owner of the file.
"GROUP" The group associated with the file.
"EVERYONE" The world, including the owner and
owning group.
"INTERACTIVE" Accessed from an interactive terminal.
"NETWORK" Accessed via the network.
"DIALUP" Accessed as a dialup user to the server.
"BATCH" Accessed from a batch job.
"ANONYMOUS" Accessed without any authentication.
"AUTHENTICATED" Any authenticated user (opposite of
ANONYMOUS)
"SERVICE" Access from a system service.
To avoid conflict, these special identifiers are distinguish by an
appended "@" and should appear in the form "xxxx@" (note: no domain
name after the "@"). For example: ANONYMOUS@.
3.16.4.1. Discussion on EVERYONE@
It is important to note that "EVERYONE@" is not equivalent to the
UNIX "other" entity. This is because, by definition, UNIX "other"
does not include the owner or owning group of a file. "EVERYONE@"
means literally everyone, including the owner or owning group.
3.16.4.2. Discussion on OWNER@ and GROUP@
Due to the use of the special identifiers "OWNER@" and "GROUP@" to
indicate that an ACE applies to the the owner and owning group,
respectively, associated with a file, the ACL cannot be used to
determine the owner and owning group of a file. This information
should be indicated by the values of the owner and owner_group file
attributes returned by the server.
3.16.5. Mode Attribute
The NFS version 4 mode attribute is based on the UNIX mode bits. The
following bits are defined:
const MODE4_SUID = 0x800; /* set user id on execution */
const MODE4_SGID = 0x400; /* set group id on execution */
const MODE4_SVTX = 0x200; /* save text even after use */
const MODE4_RUSR = 0x100; /* read permission: owner */
const MODE4_WUSR = 0x080; /* write permission: owner */
const MODE4_XUSR = 0x040; /* execute permission: owner */
const MODE4_RGRP = 0x020; /* read permission: group */
const MODE4_WGRP = 0x010; /* write permission: group */
const MODE4_XGRP = 0x008; /* execute permission: group */
const MODE4_ROTH = 0x004; /* read permission: other */
const MODE4_WOTH = 0x002; /* write permission: other */
const MODE4_XOTH = 0x001; /* execute permission: other */
Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal
identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and
MODE4_XGRP apply to the principals identified in the owner_group
attribute. Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any
principal that does not match that in the owner group, and does not
have a group matching that of the owner_group attribute.
The remaining bits are not defined by this protocol and MUST NOT be
used. The minor version mechanism must be used to define further bit
usage.
Note that in UNIX, if a file has the MODE4_SGID bit set and no
MODE4_XGRP bit set, then READ and WRITE must use mandatory file
locking.
3.16.6. Interaction Between Mode and ACL Attributes
As defined, there is a certain amount of overlap between ACL and mode
file attributes. Even though there is overlap, ACLs don't contain
all the information specified by a mode and modes can't possibly
contain all the information specified by an ACL.
For servers that support both mode and ACL, the mode's MODE4_R*,
MODE4_W* and MODE4_X* values should be computed from the ACL and
should be recomputed upon each SETATTR of ACL. Similarly, upon
SETATTR of mode, the ACL should be modified in order to allow the
mode computed from the ACL to be the same as the mode given to
SETATTR. The mode computed from any given ACL should be
deterministic. This means that given an ACL, the same mode will
always be computed.
For servers that support ACL and not mode, clients may handle
applications which set and get the mode by creating the correct ACL
to send to the server and by computing the mode from the ACL,
respectively. In this case, the methods used by the server to keep
the mode in sync with the ACL can also be used by the client. These
methods are explained in sections Section 3.16.6.3 Section 3.16.6.1
and Section 3.16.6.2.
Since the mode can't possibly represent all of the information that
is defined by an ACL, there are some descrepencies to be aware of.
As explained in the section "Deficiencies in a Mode Representation of
an ACL", the mode bits computed from the ACL could potentially convey
more restrictive permissions than what would be granted via the ACL.
Because of this clients are not recommended to do their own access
checks based on the mode of a file.
Because the mode attribute includes bits (i.e. MODE4_SUID,
MODE4_SGID, MODE4_SVTX) that have nothing to do with ACL semantics,
it is permitted for clients to specify both the ACL attribute and
mode in the same SETATTR operation. However, because there is no
prescribed order for processing the attributes in a SETATTR, clients
may see differing results. For recommendations on how to achieve
consistent behavior, see Section 3.16.6.4 for recommendations.
3.16.6.1. Recomputing mode upon SETATTR of ACL
Keeping the mode and ACL attributes synchronized is important, but as
mentioned previously, the mode cannot possibly represent all of the
information in the ACL. Still, the mode should be modified to
represent the access as accurately as possible.
The general algorithm to assign a new mode attribute to an object
based on a new ACL being set is:
1. Walk through the ACEs in order, looking for ACEs with a "who"
value of OWNER@, GROUP@, or EVERYONE@.
2. It is understood that ACEs with a "who" value of OWNER@ affect
the *USR bits of the mode, GROUP@ affect *GRP bits, and EVERYONE@
affect *USR, *GRP, and *OTH bits.
3. If such an ACE specifies ALLOW or DENY for ACE4_READ_DATA,
ACE4_WRITE_DATA, or ACE4_EXECUTE, and the mode bits affected have
not been determined yet, set them to one (if ALLOW) or zero (if
DENY).
4. Upon completion, any mode bits as yet undetermined have a value
of zero.
This pseudocode more precisely describes the algorithm:
/* octal constants for the mode bits */
RUSR = 0400
WUSR = 0200
XUSR = 0100
RGRP = 0040
WGRP = 0020
XGRP = 0010
ROTH = 0004
WOTH = 0002
XOTH = 0001
/*
* old_mode represents the previous value
* of the mode of the object.
*/
mode_t mode = 0, seen = 0;
for each ACE a {
if a.type is ALLOW or DENY and
ACE4_INHERIT_ONLY_ACE is not set in a.flags {
if a.who is OWNER@ {
if ((a.mask & ACE4_READ_DATA) &&
(! (seen & RUSR))) {
seen |= RUSR;
if a.type is ALLOW {
mode |= RUSR;
}
}
if ((a.mask & ACE4_WRITE_DATA) &&
(! (seen & WUSR))) {
seen |= WUSR;
if a.type is ALLOW {
mode |= WUSR;
}
}
if ((a.mask & ACE4_EXECUTE) &&
(! (seen & XUSR))) {
seen |= XUSR;
if a.type is ALLOW {
mode |= XUSR;
}
}
} else if a.who is GROUP@ {
if ((a.mask & ACE4_READ_DATA) &&
(! (seen & RGRP))) {
seen |= RGRP;
if a.type is ALLOW {
mode |= RGRP;
}
}
if ((a.mask & ACE4_WRITE_DATA) &&
(! (seen & WGRP))) {
seen |= WGRP;
if a.type is ALLOW {
mode |= WGRP;
}
}
if ((a.mask & ACE4_EXECUTE) &&
(! (seen & XGRP))) {
seen |= XGRP;
if a.type is ALLOW {
mode |= XGRP;
}
}
} else if a.who is EVERYONE@ {
if (a.mask & ACE4_READ_DATA) {
if ! (seen & RUSR) {
seen |= RUSR;
if a.type is ALLOW {
mode |= RUSR;
}
}
if ! (seen & RGRP) {
seen |= RGRP;
if a.type is ALLOW {
mode |= RGRP;
}
}
if ! (seen & ROTH) {
seen |= ROTH;
if a.type is ALLOW {
mode |= ROTH;
}
}
}
if (a.mask & ACE4_WRITE_DATA) {
if ! (seen & WUSR) {
seen |= WUSR;
if a.type is ALLOW {
mode |= WUSR;
}
}
if ! (seen & WGRP) {
seen |= WGRP;
if a.type is ALLOW {
mode |= WGRP;
}
}
if ! (seen & WOTH) {
seen |= WOTH;
if a.type is ALLOW {
mode |= WOTH;
}
}
}
if (a.mask & ACE4_EXECUTE) {
if ! (seen & XUSR) {
seen |= XUSR;
if a.type is ALLOW {
mode |= XUSR;
}
}
if ! (seen & XGRP) {
seen |= XGRP;
if a.type is ALLOW {
mode |= XGRP;
}
}
if ! (seen & XOTH) {
seen |= XOTH;
if a.type is ALLOW {
mode |= XOTH;
}
}
}
}
}
}
return mode | (old_mode & (SUID | SGID | SVTX))
3.16.6.2. Applying the mode given to CREATE or OPEN to an inherited ACL
The goal of implementing ACL inheritance is for newly created objects
to inherit the ACLs they were intended to inherit, but without
disregarding the mode that is given with the arguments to the CREATE
or OPEN operations. The general algorithm is as follows:
1. Form an ACL on the newly created object that is the concatenation
of all inheritable ACEs from its parent directory. Note that
there may be zero inheritable ACEs; thus, an object may start
with an empty ACL.
2. For each ACE in the new ACL, adjust its flags if necessary, and
possibly create two ACEs in place of one. This is necessary to
honor the intent of the inheritance- related flags and to
preserve information about the original inheritable ACEs in the
case that they will be modified by other steps. The algorithm is
as follows:
A. If the ACE4_NO_PROPAGATE_INHERIT_ACE is set, or if the object
being created is not a directory, then clear the following
flags:
ACE4_NO_PROPAGATE_INHERIT_ACE
ACE4_FILE_INHERIT_ACE
ACE4_DIRECTORY_INHERIT_ACE
ACE4_INHERIT_ONLY_ACE
Continue on to the next ACE.
B. If the object being created is a directory and
ACE4_FILE_INHERIT_ACE is set, but ACE4_DIRECTORY_INHERIT_ACE
is NOT set, then we ensure that ACE4_INHERIT_ONLY_ACE is set.
Continue on to the next ACE. Otherwise:
C. If the type of the ACE is neither ALLOW nor DENY, then
continue on to the next ACE.
D. Copy the original ACE into a second, adjacent ACE.
E. On the first ACE, ensure that ACE4_INHERIT_ONLY_ACE is set.
F. On the second ACE, clear the following flags:
ACE4_NO_PROPAGATE_INHERIT_ACE
ACE4_FILE_INHERIT_ACE
ACE4_DIRECTORY_INHERIT_ACE
ACE4_INHERIT_ONLY_ACE
G. On the second ACE, if the type field is ALLOW, an
implementation MAY clear the following mask bits:
ACE4_WRITE_ACL
ACE4_WRITE_OWNER
3. To ensure that the mode is honored, apply the algorithm for
applying a mode to a file/directory with an existing ACL on the
new object as described in Section 3.16.6.3, using the mode that
is to be used for file creation.
3.16.6.3. Applying a Mode to an Existing ACL
An existing ACL can mean two things in this context. One, that a
file/directory already exists and it has an ACL. Two, that a
directory has inheritable ACEs that will make up the ACL for any new
files or directories created therein.
The high-level goal of the behavior when a mode is set on a file with
an existing ACL is to take the new mode into account, without needing
to delete a pre-existing ACL.
When a mode is applied to an object, e.g. via SETATTR or CREATE/OPEN,
the ACL must be modified to accommodate the mode.
1. The ACL is traversed, one ACE at a time. For each ACE:
1. If the type of the ACE is neither ALLOW nor DENY, the ACE is
left unchanged. Continue to the next ACE.
2. If the ACE4_INHERIT_ONLY_ACE flag is set on the ACE, it is
left unchanged. Continue to the next ACE.
3. If either or both of ACE4_FILE_INHERIT_ACE or
ACE4_DIRECTORY_INHERIT_ACE are set:
1. A copy of the ACE is made, and placed in the ACL
immediately following the current ACE.
2. In the first ACE, the flag ACE4_INHERIT_ONLY_ACE is set.
3. In the second ACE, the following flags are cleared:
ACE4_FILE_INHERIT_ACE
ACE4_DIRECTORY_INHERIT_ACE
ACE4_NO_PROPAGATE_INHERIT_ACE
The algorithm continues on with the second ACE.
4. If the "who" field is one of the following:
OWNER@
GROUP@
EVERYONE@
then the following mask bits are cleared:
ACE4_READ_DATA / ACE4_LIST_DIRECTORY
ACE4_WRITE_DATA / ACE4_ADD_FILE
ACE4_APPEND_DATA / ACE4_ADD_SUBDIRECTORY
ACE4_EXECUTE
At this point, we proceed to the next ACE.
5. Otherwise, if the "who" field did not match one of OWNER@,
GROUP@, or EVERYONE@, the following steps SHOULD be
performed.
1. If the type of the ACE is ALLOW, we check the preceding
ACE (if any). If it does not meet all of the following
criteria:
1. The type field is DENY.
2. The who field is the same as the current ACE.
3. The flag bit ACE4_IDENTIFIER_GROUP is the same as it
is in the current ACE, and no other flag bits are
set.
4. The mask bits are a subset of the mask bits of the
current ACE, and are also a subset of the following:
ACE4_READ_DATA / ACE4_LIST_DIRECTORY
ACE4_WRITE_DATA / ACE4_ADD_FILE
ACE4_APPEND_DATA / ACE4_ADD_SUBDIRECTORY
ACE4_EXECUTE
then an ACE of type DENY, with a who equal to the current
ACE, flag bits equal to (<current-ACE-flags> &
ACE4_IDENTIFIER_GROUP), and no mask bits, is prepended.
2. The following modifications are made to the prepended
ACE. The intent is to mask the following ACE to disallow
ACE4_READ_DATA, ACE4_WRITE_DATA, ACE4_APPEND_DATA, or
ACE4_EXECUTE, based upon the group permissions of the new
mode. As a special case, if the ACE matches the current
owner of the file, the owner bits are used, rather than
the group bits. This is reflected in the algorithm
below.
Let there be three bits defined:
#define READ 04
#define WRITE 02
#define EXEC 01
Let "amode" be the new mode, right-shifted three
bits, in order to have the group permission bits
placed in the three low order bits of amode,
i.e. amode = mode >> 3
If ACE4_IDENTIFIER_GROUP is not set in the flags,
and the "who" field of the ACE matches the owner
of the file, we shift amode three more bits, in
order to have the owner permission bits placed in
the three low order bits of amode:
amode = amode >> 3
amode is now used as follows:
If ACE4_READ_DATA is set on the current ACE:
If READ is set on amode:
ACE4_READ_DATA is cleared on the prepended ACE
else:
ACE4_READ_DATA is set on the prepended ACE
If ACE4_WRITE_DATA is set on the current ACE:
If WRITE is set on amode:
ACE4_WRITE_DATA is cleared on the prepended ACE
else:
ACE4_WRITE_DATA is set on the prepended ACE
If ACE4_APPEND_DATA is set on the current ACE:
If WRITE is set on amode:
ACE4_APPEND_DATA is cleared on the
prepended ACE
else:
ACE4_APPEND_DATA is set on the prepended ACE
If ACE4_EXECUTE is set on the current ACE:
If EXEC is set on amode:
ACE4_EXECUTE is cleared on the prepended ACE
else:
ACE4_EXECUTE is set on the prepended ACE
3. To conform with POSIX, and prevent cases where the owner
of the file is given permissions via an explicit group,
we implement the following step.
If ACE4_IDENTIFIER_GROUP is set in the flags field of
the ALLOW ACE:
Let "mode" be the mode that we are chmoding to:
extramode = (mode >> 3) & 07
ownermode = mode >> 6
extramode &= ~ownermode
If extramode is not zero:
If extramode & READ:
Clear ACE4_READ_DATA in both the
prepended DENY ACE and the ALLOW ACE
If extramode & WRITE:
Clear ACE4_WRITE_DATA and ACE_APPEND_DATA
in both the prepended DENY ACE and the
ALLOW ACE
If extramode & EXEC:
Clear ACE4_EXECUTE in both the prepended
DENY ACE and the ALLOW ACE
2. If there are at least six ACEs, the final six ACEs are examined.
If they are not equal to the following ACEs:
A1) OWNER@:::DENY
A2) OWNER@:ACE4_WRITE_ACL/ACE4_WRITE_OWNER/
ACE4_WRITE_ATTRIBUTES/ACE4_WRITE_NAMED_ATTRIBUTES::ALLOW
A3) GROUP@::ACE4_IDENTIFIER_GROUP:DENY
A4) GROUP@::ACE4_IDENTIFIER_GROUP:ALLOW
A5) EVERYONE@:ACE4_WRITE_ACL/ACE4_WRITE_OWNER/
ACE4_WRITE_ATTRIBUTES/ACE4_WRITE_NAMED_ATTRIBUTES::DENY
A6) EVERYONE@:ACE4_READ_ACL/ACE4_READ_ATTRIBUTES/
ACE4_READ_NAMED_ATTRIBUTES/ACE4_SYNCHRONIZE::ALLOW
Then six ACEs matching the above are appended.
3. The final six ACEs are adjusted according to the incoming mode.
/* octal constants for the mode bits */
RUSR = 0400
WUSR = 0200
XUSR = 0100
RGRP = 0040
WGRP = 0020
XGRP = 0010
ROTH = 0004
WOTH = 0002
XOTH = 0001
If RUSR is set: set ACE4_READ_DATA in A2
else: set ACE4_READ_DATA in A1
If WUSR is set: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A2
else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A1
If XUSR is set: set ACE4_EXECUTE in A2
else: set ACE4_EXECUTE in A1
If RGRP is set: set ACE4_READ_DATA in A4
else: set ACE4_READ_DATA in A3
If WGRP is set: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A4
else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A3
If XGRP is set: set ACE4_EXECUTE in A4
else: set ACE4_EXECUTE in A3
If ROTH is set: set ACE4_READ_DATA in A6
else: set ACE4_READ_DATA in A5
If WOTH is set: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A6
else: set ACE4_WRITE_DATA and ACE4_APPEND_DATA in A5
If XOTH is set: set ACE4_EXECUTE in A6
else: set ACE4_EXECUTE in A5
3.16.6.4. ACL and mode in the same SETATTR
The only reason that a mode and ACL should be set in the same SETATTR
is if the user wants to set the SUID, SGID and SVTX bits along with
setting the permissions by means of an ACL. There is still no way to
enforce which order the attributes will be set in, and it is likely
that different orders of operations will produce different results.
3.16.6.4.1. Client Side Recommendations
If an application needs to enforce a certain behavior, it is
recommended that the client implementations set mode and ACL in
separate SETATTR requests. This will produce consistent and expected
results.
If an application wants to set SUID, SGID and SVTX bits and an ACL:
In the first SETATTR, set the mode with SUID, SGID and SVTX bits
as desired and all other bits with a value of 0.
In a following SETATTR (preferably in the same COMPOUND) set the
ACL.
3.16.6.4.2. Server Side Recommendations
If both mode and ACL are given to SETATTR, server implementations
should verify that the mode and ACL don't conflict, i.e. the mode
computed from the given ACL must be the same as the given mode,
excluding the SUID, SGID and SVTX bits. The algorithm for assigning
a new mode based on the ACL can be used. This is described in
section Section 3.16.6.1. If a server receives a request to set both
mode and ACL, but the two conflict, the server should return
NFS4ERR_INVAL.
3.16.6.5. Inheritance and turning it off
The inheritance of access permissions may be problematic if a user
cannot prevent their file from inheriting unwanted permissions. For
example, a user, "samf", sets up a shared project directory to be
used by everyone working on Project Foo. "lisagab" is a part of
Project Foo, but is working on something that should not be seen by
anyone else. How can "lisagab" make sure that any new files that she
creates in this shared project directory do not inherit anything that
could compromise the security of her work?
More relevant to the implementors of NFS version 4 clients and
servers is the question of how to communicate the fact that user,
"lisagab", doesn't want any permissions to be inherited to her newly
created file or directory.
To do this, implementors should standardize on what the behavior of
CREATE and OPEN must be if:
1. just mode is given
In this case, inheritance will take place, but the mode will be
applied to the inherited ACL as described in Section 3.16.6.1,
thereby modifying the ACL.
2. just ACL is given
In this case, inheritance will not take place, and the ACL as
defined in the CREATE or OPEN will be set without modification.
3. both mode and ACL are given
In this case, implementors should verify that the mode and ACL
don't conflict, i.e. the mode computed from the given ACL must be
the same as the given mode. The algorithm for assigning a new
mode based on the ACL can be used. This is described in
Section 3.16.6.1) If a server receives a request to set both mode
and ACL, but the two conflict, the server should return
NFS4ERR_INVAL. If the mode and ACL don't conflict, inheritance
will not take placeand both, the mode and ACL, will be set
without modification.
4. neither mode nor ACL are given
In this case, inheritance will take place and no modifications to
the ACL will happen. It is worth noting that if no inheritable
ACEs exist on the parent directory, the file will be created with
an empty ACL, thus granting no accesses.
3.16.6.6. Deficiencies in a Mode Representation of an ACL
In the presence of an ACL, there are certain cases when the
representation of the mode is not guaranteed to be accurate. An
example of a situation is detailed below.
As mentioned in Section 3.16.6, the representation of the mode is
deterministic, but not guaranteed to be accurate. The mode bits
potentially convey a more restrictive permission than what will
actually be granted via the ACL.
Given the following ACL of two ACEs:
GROUP@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE:
ACE4_IDENTIFIER_GROUP:ALLOW
EVERYONE@:ACE4_READ_DATA/ACE4_WRITE_DATA/ACE4_EXECUTE::DENY
we would compute a mode of 0070. However, it is possible, even
likely, that the owner might be a member of the object's owning
group, and thus, the owner would be granted read, write, and execute
access to the object. This would conflict with the mode of 0070,
where an owner would be denied this access.
The only way to overcome this deficiency would be to determine
whether the object's owner is a member of the object's owning group.
This is difficult, but worse, on a POSIX or any UNIX-like system, it
is a process' membership in a group that is important, not a user's.
Thus, any fixed mode intended to represent the above ACL can be
incorrect.
Example: administrative databases (possibly /etc/passwd and /etc/
group) indicate that the user "bob" is a member of the group "staff".
An object has the ACL given above, is owned by "bob", and has an
owning group of "staff". User "bob" has logged into the system, and
thus processes have been created owned by "bob" and having membership
in group "staff".
A mode representation of the above ACL could thus be 0770, due to
user "bob" having membership in group "staff". Now, the
administrative databases are changed, such that user "bob" is no
longer in group "staff". User "bob" logs in to the system again, and
thus more processes are created, this time owned by "bob" but NOT in
group "staff".
A mode of 0770 is inaccurate for processes not belonging to group
"staff". But even if the mode of the file were proactively changed
to 0070 at the time the group database was edited, mode 0070 would be
inaccurate for the pre-existing processes owned by user "bob" and
having membership in group "staff".
4. Filesystem Migration and Replication
With the use of the recommended attribute "fs_locations", the NFS
version 4 server has a method of providing filesystem migration or
replication services. For the purposes of migration and replication,
a filesystem will be defined as all files that share a given fsid
(both major and minor values are the same).
The fs_locations attribute provides a list of filesystem locations.
These locations are specified by providing the server name (either
DNS domain or IP address) and the path name representing the root of
the filesystem. Depending on the type of service being provided, the
list will provide a new location or a set of alternate locations for
the filesystem. The client will use this information to redirect its
requests to the new server.
4.1. Replication
It is expected that filesystem replication will be used in the case
of read-only data. Typically, the filesystem will be replicated on
two or more servers. The fs_locations attribute will provide the
list of these locations to the client. On first access of the
filesystem, the client should obtain the value of the fs_locations
attribute. If, in the future, the client finds the server
unresponsive, the client may attempt to use another server specified
by fs_locations.
If applicable, the client must take the appropriate steps to recover
valid filehandles from the new server. This is described in more
detail in the following sections.
4.2. Migration
Filesystem migration is used to move a filesystem from one server to
another. Migration is typically used for a filesystem that is
writable and has a single copy. The expected use of migration is for
load balancing or general resource reallocation. The protocol does
not specify how the filesystem will be moved between servers. This
server-to-server transfer mechanism is left to the server
implementor. However, the method used to communicate the migration
event between client and server is specified here.
Once the servers participating in the migration have completed the
move of the filesystem, the error NFS4ERR_MOVED will be returned for
subsequent requests received by the original server. The
NFS4ERR_MOVED error is returned for all operations except PUTFH and
GETATTR. Upon receiving the NFS4ERR_MOVED error, the client will
obtain the value of the fs_locations attribute. The client will then
use the contents of the attribute to redirect its requests to the
specified server. To facilitate the use of GETATTR, operations such
as PUTFH must also be accepted by the server for the migrated file
system's filehandles. Note that if the server returns NFS4ERR_MOVED,
the server MUST support the fs_locations attribute.
If the client requests more attributes than just fs_locations, the
server may return fs_locations only. This is to be expected since
the server has migrated the filesystem and may not have a method of
obtaining additional attribute data.
The server implementor needs to be careful in developing a migration
solution. The server must consider all of the state information
clients may have outstanding at the server. This includes but is not
limited to locking/share state, delegation state, and asynchronous
file writes which are represented by WRITE and COMMIT verifiers. The
server should strive to minimize the impact on its clients during and
after the migration process.
4.3. Interpretation of the fs_locations Attribute
The fs_location attribute is structured in the following way:
struct fs_location {
utf8str_cis server&lt>;
pathname4 rootpath;
};
struct fs_locations {
pathname4 fs_root;
fs_location locations&lt>;
};
The fs_location struct is used to represent the location of a
filesystem by providing a server name and the path to the root of the
filesystem. For a multi-homed server or a set of servers that use
the same rootpath, an array of server names may be provided. An
entry in the server array is an UTF8 string and represents one of a
traditional DNS host name, IPv4 address, or IPv6 address. It is not
a requirement that all servers that share the same rootpath be listed
in one fs_location struct. The array of server names is provided for
convenience. Servers that share the same rootpath may also be listed
in separate fs_location entries in the fs_locations attribute.
The fs_locations struct and attribute then contains an array of
locations. Since the name space of each server may be constructed
differently, the "fs_root" field is provided. The path represented
by fs_root represents the location of the filesystem in the server's
name space. Therefore, the fs_root path is only associated with the
server from which the fs_locations attribute was obtained. The
fs_root path is meant to aid the client in locating the filesystem at
the various servers listed.
As an example, there is a replicated filesystem located at two
servers (servA and servB). At servA the filesystem is located at
path "/a/b/c". At servB the filesystem is located at path "/x/y/z".
In this example the client accesses the filesystem first at servA
with a multi-component lookup path of "/a/b/c/d". Since the client
used a multi-component lookup to obtain the filehandle at "/a/b/c/d",
it is unaware that the filesystem's root is located in servA's name
space at "/a/b/c". When the client switches to servB, it will need
to determine that the directory it first referenced at servA is now
represented by the path "/x/y/z/d" on servB. To facilitate this, the
fs_locations attribute provided by servA would have a fs_root value
of "/a/b/c" and two entries in fs_location. One entry in fs_location
will be for itself (servA) and the other will be for servB with a
path of "/x/y/z". With this information, the client is able to
substitute "/x/y/z" for the "/a/b/c" at the beginning of its access
path and construct "/x/y/z/d" to use for the new server.
See the section "Security Considerations" for a discussion on the
recommendations for the security flavor to be used by any GETATTR
operation that requests the "fs_locations" attribute.
4.4. Filehandle Recovery for Migration or Replication
Filehandles for filesystems that are replicated or migrated generally
have the same semantics as for filesystems that are not replicated or
migrated. For example, if a filesystem has persistent filehandles
and it is migrated to another server, the filehandle values for the
filesystem will be valid at the new server.
For volatile filehandles, the servers involved likely do not have a
mechanism to transfer filehandle format and content between
themselves. Therefore, a server may have difficulty in determining
if a volatile filehandle from an old server should return an error of
NFS4ERR_FHEXPIRED. Therefore, the client is informed, with the use
of the fh_expire_type attribute, whether volatile filehandles will
expire at the migration or replication event. If the bit
FH4_VOL_MIGRATION is set in the fh_expire_type attribute, the client
must treat the volatile filehandle as if the server had returned the
NFS4ERR_FHEXPIRED error. At the migration or replication event in
the presence of the FH4_VOL_MIGRATION bit, the client will not
present the original or old volatile filehandle to the new server.
The client will start its communication with the new server by
recovering its filehandles using the saved file names.
5. NFS Server Name Space
5.1. Server Exports
On a UNIX server the name space describes all the files reachable by
pathnames under the root directory or "/". On a Windows NT server
the name space constitutes all the files on disks named by mapped
disk letters. NFS server administrators rarely make the entire
server's filesystem name space available to NFS clients. More often
portions of the name space are made available via an "export"
feature. In previous versions of the NFS protocol, the root
filehandle for each export is obtained through the MOUNT protocol;
the client sends a string that identifies the export of name space
and the server returns the root filehandle for it. The MOUNT
protocol supports an EXPORTS procedure that will enumerate the
server's exports.
5.2. Browsing Exports
The NFS version 4 protocol provides a root filehandle that clients
can use to obtain filehandles for these exports via a multi-component
LOOKUP. A common user experience is to use a graphical user
interface (perhaps a file "Open" dialog window) to find a file via
progressive browsing through a directory tree. The client must be
able to move from one export to another export via single-component,
progressive LOOKUP operations.
This style of browsing is not well supported by the NFS version 2 and
3 protocols. The client expects all LOOKUP operations to remain
within a single server filesystem. For example, the device attribute
will not change. This prevents a client from taking name space paths
that span exports.
An automounter on the client can obtain a snapshot of the server's
name space using the EXPORTS procedure of the MOUNT protocol. If it
understands the server's pathname syntax, it can create an image of
the server's name space on the client. The parts of the name space
that are not exported by the server are filled in with a "pseudo
filesystem" that allows the user to browse from one mounted
filesystem to another. There is a drawback to this representation of
the server's name space on the client: it is static. If the server
administrator adds a new export the client will be unaware of it.
5.3. Server Pseudo Filesystem
NFS version 4 servers avoid this name space inconsistency by
presenting all the exports within the framework of a single server
name space. An NFS version 4 client uses LOOKUP and READDIR
operations to browse seamlessly from one export to another. Portions
of the server name space that are not exported are bridged via a
"pseudo filesystem" that provides a view of exported directories
only. A pseudo filesystem has a unique fsid and behaves like a
normal, read only filesystem.
Based on the construction of the server's name space, it is possible
that multiple pseudo filesystems may exist. For example,
/a pseudo filesystem
/a/b real filesystem
/a/b/c pseudo filesystem
/a/b/c/d real filesystem
Each of the pseudo filesystems are considered separate entities and
therefore will have a unique fsid.
5.4. Multiple Roots
The DOS and Windows operating environments are sometimes described as
having "multiple roots". Filesystems are commonly represented as
disk letters. MacOS represents filesystems as top level names. NFS
version 4 servers for these platforms can construct a pseudo file
system above these root names so that disk letters or volume names
are simply directory names in the pseudo root.
5.5. Filehandle Volatility
The nature of the server's pseudo filesystem is that it is a logical
representation of filesystem(s) available from the server.
Therefore, the pseudo filesystem is most likely constructed
dynamically when the server is first instantiated. It is expected
that the pseudo filesystem may not have an on disk counterpart from
which persistent filehandles could be constructed. Even though it is
preferable that the server provide persistent filehandles for the
pseudo filesystem, the NFS client should expect that pseudo file
system filehandles are volatile. This can be confirmed by checking
the associated "fh_expire_type" attribute for those filehandles in
question. If the filehandles are volatile, the NFS client must be
prepared to recover a filehandle value (e.g. with a multi-component
LOOKUP) when receiving an error of NFS4ERR_FHEXPIRED.
5.6. Exported Root
If the server's root filesystem is exported, one might conclude that
a pseudo-filesystem is not needed. This would be wrong. Assume the
following filesystems on a server:
/ disk1 (exported)
/a disk2 (not exported)
/a/b disk3 (exported)
Because disk2 is not exported, disk3 cannot be reached with simple
LOOKUPs. The server must bridge the gap with a pseudo-filesystem.
5.7. Mount Point Crossing
The server filesystem environment may be constructed in such a way
that one filesystem contains a directory which is 'covered' or
mounted upon by a second filesystem. For example:
/a/b (filesystem 1)
/a/b/c/d (filesystem 2)
The pseudo filesystem for this server may be constructed to look
like:
/ (place holder/not exported)
/a/b (filesystem 1)
/a/b/c/d (filesystem 2)
It is the server's responsibility to present the pseudo filesystem
that is complete to the client. If the client sends a lookup request
for the path "/a/b/c/d", the server's response is the filehandle of
the filesystem "/a/b/c/d". In previous versions of the NFS protocol,
the server would respond with the filehandle of directory "/a/b/c/d"
within the filesystem "/a/b".
The NFS client will be able to determine if it crosses a server mount
point by a change in the value of the "fsid" attribute.
5.8. Security Policy and Name Space Presentation
The application of the server's security policy needs to be carefully
considered by the implementor. One may choose to limit the
viewability of portions of the pseudo filesystem based on the
server's perception of the client's ability to authenticate itself
properly. However, with the support of multiple security mechanisms
and the ability to negotiate the appropriate use of these mechanisms,
the server is unable to properly determine if a client will be able
to authenticate itself. If, based on its policies, the server
chooses to limit the contents of the pseudo filesystem, the server
may effectively hide filesystems from a client that may otherwise
have legitimate access.
As suggested practice, the server should apply the security policy of
a shared resource in the server's namespace to the components of the
resource's ancestors. For example:
/
/a/b
/a/b/c
The /a/b/c directory is a real filesystem and is the shared resource.
The security policy for /a/b/c is Kerberos with integrity. The
server should apply the same security policy to /, /a, and /a/b.
This allows for the extension of the protection of the server's
namespace to the ancestors of the real shared resource.
For the case of the use of multiple, disjoint security mechanisms in
the server's resources, the security for a particular object in the
server's namespace should be the union of all security mechanisms of
all direct descendants.
6. File Locking and Share Reservations
Integrating locking into the NFS protocol necessarily causes it to be
stateful. With the inclusion of share reservations the protocol
becomes substantially more dependent on state than the traditional
combination of NFS and NLM [XNFS]. There are three components to
making this state manageable:
o Clear division between client and server
o Ability to reliably detect inconsistency in state between client
and server
o Simple and robust recovery mechanisms
In this model, the server owns the state information. The client
communicates its view of this state to the server as needed. The
client is also able to detect inconsistent state before modifying a
file.
To support Win32 share reservations it is necessary to atomically
OPEN or CREATE files. Having a separate share/unshare operation
would not allow correct implementation of the Win32 OpenFile API. In
order to correctly implement share semantics, the previous NFS
protocol mechanisms used when a file is opened or created (LOOKUP,
CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has
an OPEN operation that subsumes the NFS version 3 methodology of
LOOKUP, CREATE, and ACCESS. However, because many operations require
a filehandle, the traditional LOOKUP is preserved to map a file name
to filehandle without establishing state on the server. The policy
of granting access or modifying files is managed by the server based
on the client's state. These mechanisms can implement policy ranging
from advisory only locking to full mandatory locking.
6.1. Locking
It is assumed that manipulating a lock is rare when compared to READ
and WRITE operations. It is also assumed that crashes and network
partitions are relatively rare. Therefore it is important that the
READ and WRITE operations have a lightweight mechanism to indicate if
they possess a held lock. A lock request contains the heavyweight
information required to establish a lock and uniquely define the lock
owner.
The following sections describe the transition from the heavy weight
information to the eventual stateid used for most client and server
locking and lease interactions.
6.1.1. Client ID
For each LOCK request, the client must identify itself to the server.
This is done in such a way as to allow for correct lock
identification and crash recovery. A sequence of a SETCLIENTID
operation followed by a SETCLIENTID_CONFIRM operation is required to
establish the identification onto the server. Establishment of
identification by a new incarnation of the client also has the effect
of immediately breaking any leased state that a previous incarnation
of the client might have had on the server, as opposed to forcing the
new client incarnation to wait for the leases to expire. Breaking
the lease state amounts to the server removing all lock, share
reservation, and, where the server is not supporting the
CLAIM_DELEGATE_PREV claim type, all delegation state associated with
same client with the same identity. For discussion of delegation
state recovery, see the section "Delegation Recovery".
Client identification is encapsulated in the following structure:
struct nfs_client_id4 {
verifier4 verifier;
opaque id&ltNFS4_OPAQUE_LIMIT>;
};
The first field, verifier is a client incarnation verifier that is
used to detect client reboots. Only if the verifier is different
from that the server has previously recorded the client (as
identified by the second field of the structure, id) does the server
start the process of canceling the client's leased state.
The second field, id is a variable length string that uniquely
defines the client.
There are several considerations for how the client generates the id
string:
o The string should be unique so that multiple clients do not
present the same string. The consequences of two clients
presenting the same string range from one client getting an error
to one client having its leased state abruptly and unexpectedly
canceled.
o The string should be selected so the subsequent incarnations (e.g.
reboots) of the same client cause the client to present the same
string. The implementor is cautioned from an approach that
requires the string to be recorded in a local file because this
precludes the use of the implementation in an environment where
there is no local disk and all file access is from an NFS version
4 server.
o The string should be different for each server network address
that the client accesses, rather than common to all server network
addresses. The reason is that it may not be possible for the
client to tell if same server is listening on multiple network
addresses. If the client issues SETCLIENTID with the same id
string to each network address of such a server, the server will
think it is the same client, and each successive SETCLIENTID will
cause the server to begin the process of removing the client's
previous leased state.
o The algorithm for generating the string should not assume that the
client's network address won't change. This includes changes
between client incarnations and even changes while the client is
stilling running in its current incarnation. This means that if
the client includes just the client's and server's network address
in the id string, there is a real risk, after the client gives up
the network address, that another client, using a similar
algorithm for generating the id string, will generate a
conflicting id string.
o Given the above considerations, an example of a well generated id
string is one that includes:
o The server's network address.
o The client's network address.
o For a user level NFS version 4 client, it should contain
additional information to distinguish the client from other user
level clients running on the same host, such as a process id or
other unique sequence.
o Additional information that tends to be unique, such as one or
more of:
* The client machine's serial number (for privacy reasons, it is
best to perform some one way function on the serial number).
* A MAC address.
* The timestamp of when the NFS version 4 software was first
installed on the client (though this is subject to the
previously mentioned caution about using information that is
stored in a file, because the file might only be accessible
over NFS version 4).
* A true random number. However since this number ought to be
the same between client incarnations, this shares the same
problem as that of the using the timestamp of the software
installation.
As a security measure, the server MUST NOT cancel a client's leased
state if the principal established the state for a given id string is
not the same as the principal issuing the SETCLIENTID.
Note that SETCLIENTID and SETCLIENTID_CONFIRM has a secondary purpose
of establishing the information the server needs to make callbacks to
the client for purpose of supporting delegations. It is permitted to
change this information via SETCLIENTID and SETCLIENTID_CONFIRM
within the same incarnation of the client without removing the
client's leased state.
Once a SETCLIENTID and SETCLIENTID_CONFIRM sequence has successfully
completed, the client uses the short hand client identifier, of type
clientid4, instead of the longer and less compact nfs_client_id4
structure. This short hand client identifier (a clientid) is
assigned by the server and should be chosen so that it will not
conflict with a clientid previously assigned by the server. This
applies across server restarts or reboots. When a clientid is
presented to a server and that clientid is not recognized, as would
happen after a server reboot, the server will reject the request with
the error NFS4ERR_STALE_CLIENTID. When this happens, the client must
obtain a new clientid by use of the SETCLIENTID operation and then
proceed to any other necessary recovery for the server reboot case
(See the section "Server Failure and Recovery").
The client must also employ the SETCLIENTID operation when it
receives a NFS4ERR_STALE_STATEID error using a stateid derived from
its current clientid, since this also indicates a server reboot which
has invalidated the existing clientid (see the next section
"lock_owner and stateid Definition" for details).
See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM
for a complete specification of the operations.
6.1.2. Server Release of Clientid
If the server determines that the client holds no associated state
for its clientid, the server may choose to release the clientid. The
server may make this choice for an inactive client so that resources
are not consumed by those intermittently active clients. If the
client contacts the server after this release, the server must ensure
the client receives the appropriate error so that it will use the
SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity.
It should be clear that the server must be very hesitant to release a
clientid since the resulting work on the client to recover from such
an event will be the same burden as if the server had failed and
restarted. Typically a server would not release a clientid unless
there had been no activity from that client for many minutes.
Note that if the id string in a SETCLIENTID request is properly
constructed, and if the client takes care to use the same principal
for each successive use of SETCLIENTID, then, barring an active
denial of service attack, NFS4ERR_CLID_INUSE should never be
returned.
However, client bugs, server bugs, or perhaps a deliberate change of
the principal owner of the id string (such as the case of a client
that changes security flavors, and under the new flavor, there is no
mapping to the previous owner) will in rare cases result in
NFS4ERR_CLID_INUSE.
In that event, when the server gets a SETCLIENTID for a client id
that currently has no state, or it has state, but the lease has
expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST
allow the SETCLIENTID, and confirm the new clientid if followed by
the appropriate SETCLIENTID_CONFIRM.
6.1.3. lock_owner and stateid Definition
When requesting a lock, the client must present to the server the
clientid and an identifier for the owner of the requested lock.
These two fields are referred to as the lock_owner and the definition
of those fields are:
o A clientid returned by the server as part of the client's use of
the SETCLIENTID operation.
o A variable length opaque array used to uniquely define the owner
of a lock managed by the client.
This may be a thread id, process id, or other unique value.
When the server grants the lock, it responds with a unique stateid.
The stateid is used as a shorthand reference to the lock_owner, since
the server will be maintaining the correspondence between them.
The server is free to form the stateid in any manner that it chooses
as long as it is able to recognize invalid and out-of-date stateids.
This requirement includes those stateids generated by earlier
instances of the server. From this, the client can be properly
notified of a server restart. This notification will occur when the
client presents a stateid to the server from a previous
instantiation.
The server must be able to distinguish the following situations and
return the error as specified:
o The stateid was generated by an earlier server instance (i.e.
before a server reboot). The error NFS4ERR_STALE_STATEID should
be returned.
o The stateid was generated by the current server instance but the
stateid no longer designates the current locking state for the
lockowner-file pair in question (i.e. one or more locking
operations has occurred). The error NFS4ERR_OLD_STATEID should be
returned.
This error condition will only occur when the client issues a
locking request which changes a stateid while an I/O request that
uses that stateid is outstanding.
o The stateid was generated by the current server instance but the
stateid does not designate a locking state for any active
lockowner-file pair. The error NFS4ERR_BAD_STATEID should be
returned.
This error condition will occur when there has been a logic error
on the part of the client or server. This should not happen.
One mechanism that may be used to satisfy these requirements is for
the server to,
o divide the "other" field of each stateid into two fields:
* A server verifier which uniquely designates a particular server
instantiation.
* An index into a table of locking-state structures.
o utilize the "seqid" field of each stateid, such that seqid is
monotonically incremented for each stateid that is associated with
the same index into the locking-state table.
By matching the incoming stateid and its field values with the state
held at the server, the server is able to easily determine if a
stateid is valid for its current instantiation and state. If the
stateid is not valid, the appropriate error can be supplied to the
client.
6.1.4. Use of the stateid and Locking
All READ, WRITE and SETATTR operations contain a stateid. For the
purposes of this section, SETATTR operations which change the size
attribute of a file are treated as if they are writing the area
between the old and new size (i.e. the range truncated or added to
the file by means of the SETATTR), even where SETATTR is not
explicitly mentioned in the text.
If the lock_owner performs a READ or WRITE in a situation in which it
has established a lock or share reservation on the server (any OPEN
constitutes a share reservation) the stateid (previously returned by
the server) must be used to indicate what locks, including both
record locks and share reservations, are held by the lockowner. If
no state is established by the client, either record lock or share
reservation, a stateid of all bits 0 is used. Regardless whether a
stateid of all bits 0, or a stateid returned by the server is used,
if there is a conflicting share reservation or mandatory record lock
held on the file, the server MUST refuse to service the READ or WRITE
operation.
Share reservations are established by OPEN operations and by their
nature are mandatory in that when the OPEN denies READ or WRITE
operations, that denial results in such operations being rejected
with error NFS4ERR_LOCKED. Record locks may be implemented by the
server as either mandatory or advisory, or the choice of mandatory or
advisory behavior may be determined by the server on the basis of the
file being accessed (for example, some UNIX-based servers support a
"mandatory lock bit" on the mode attribute such that if set, record
locks are required on the file before I/O is possible). When record
locks are advisory, they only prevent the granting of conflicting
lock requests and have no effect on READs or WRITEs. Mandatory
record locks, however, prevent conflicting I/O operations. When they
are attempted, they are rejected with NFS4ERR_LOCKED. When the
client gets NFS4ERR_LOCKED on a file it knows it has the proper share
reservation for, it will need to issue a LOCK request on the region
of the file that includes the region the I/O was to be performed on,
with an appropriate locktype (i.e. READ*_LT for a READ operation,
WRITE*_LT for a WRITE operation).
With NFS version 3, there was no notion of a stateid so there was no
way to tell if the application process of the client sending the READ
or WRITE operation had also acquired the appropriate record lock on
the file. Thus there was no way to implement mandatory locking.
With the stateid construct, this barrier has been removed.
Note that for UNIX environments that support mandatory file locking,
the distinction between advisory and mandatory locking is subtle. In
fact, advisory and mandatory record locks are exactly the same in so
far as the APIs and requirements on implementation. If the mandatory
lock attribute is set on the file, the server checks to see if the
lockowner has an appropriate shared (read) or exclusive (write)
record lock on the region it wishes to read or write to. If there is
no appropriate lock, the server checks if there is a conflicting lock
(which can be done by attempting to acquire the conflicting lock on
the behalf of the lockowner, and if successful, release the lock
after the READ or WRITE is done), and if there is, the server returns
NFS4ERR_LOCKED.
For Windows environments, there are no advisory record locks, so the
server always checks for record locks during I/O requests.
Thus, the NFS version 4 LOCK operation does not need to distinguish
between advisory and mandatory record locks. It is the NFS version 4
server's processing of the READ and WRITE operations that introduces
the distinction.
Every stateid other than the special stateid values noted in this
section, whether returned by an OPEN-type operation (i.e. OPEN,
OPEN_DOWNGRADE), or by a LOCK-type operation (i.e. LOCK or LOCKU),
defines an access mode for the file (i.e. READ, WRITE, or READ-
WRITE) as established by the original OPEN which began the stateid
sequence, and as modified by subsequent OPENs and OPEN_DOWNGRADEs
within that stateid sequence. When a READ, WRITE, or SETATTR which
specifies the size attribute, is done, the operation is subject to
checking against the access mode to verify that the operation is
appropriate given the OPEN with which the operation is associated.
In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which
set size), the server must verify that the access mode allows writing
and return an NFS4ERR_OPENMODE error if it does not. In the case, of
READ, the server may perform the corresponding check on the access
mode, or it may choose to allow READ on opens for WRITE only, to
accommodate clients whose write implementation may unavoidably do
reads (e.g. due to buffer cache constraints). However, even if READs
are allowed in these circumstances, the server MUST still check for
locks that conflict with the READ (e.g. another open specify denial
of READs). Note that a server which does enforce the access mode
check on READs need not explicitly check for conflicting share
reservations since the existence of OPEN for read access guarantees
that no conflicting share reservation can exist.
A stateid of all bits 1 (one) MAY allow READ operations to bypass
locking checks at the server. However, WRITE operations with a
stateid with bits all 1 (one) MUST NOT bypass locking checks and are
treated exactly the same as if a stateid of all bits 0 were used.
A lock may not be granted while a READ or WRITE operation using one
of the special stateids is being performed and the range of the lock
request conflicts with the range of the READ or WRITE operation. For
the purposes of this paragraph, a conflict occurs when a shared lock
is requested and a WRITE operation is being performed, or an
exclusive lock is requested and either a READ or a WRITE operation is
being performed. A SETATTR that sets size is treated similarly to a
WRITE as discussed above.
6.1.5. Sequencing of Lock Requests
Locking is different than most NFS operations as it requires "at-
most-one" semantics that are not provided by ONCRPC. ONCRPC over a
reliable transport is not sufficient because a sequence of locking
requests may span multiple TCP connections. In the face of
retransmission or reordering, lock or unlock requests must have a
well defined and consistent behavior. To accomplish this, each lock
request contains a sequence number that is a consecutively increasing
integer. Different lock_owners have different sequences. The server
maintains the last sequence number (L) received and the response that
was returned. The first request issued for any given lock_owner is
issued with a sequence number of zero.
Note that for requests that contain a sequence number, for each
lock_owner, there should be no more than one outstanding request.
If a request (r) with a previous sequence number (r < L) is received,
it is rejected with the return of error NFS4ERR_BAD_SEQID. Given a
properly-functioning client, the response to (r) must have been
received before the last request (L) was sent. If a duplicate of
last request (r == L) is received, the stored response is returned.
If a request beyond the next sequence (r == L + 2) is received, it is
rejected with the return of error NFS4ERR_BAD_SEQID. Sequence
history is reinitialized whenever the SETCLIENTID/SETCLIENTID_CONFIRM
sequence changes the client verifier.
Since the sequence number is represented with an unsigned 32-bit
integer, the arithmetic involved with the sequence number is mod
2^32. For an example of modulo arithetic involving sequence numbers
see [RFC793].
It is critical the server maintain the last response sent to the
client to provide a more reliable cache of duplicate non-idempotent
requests than that of the traditional cache described in [Juszczak].
The traditional duplicate request cache uses a least recently used
algorithm for removing unneeded requests. However, the last lock
request and response on a given lock_owner must be cached as long as
the lock state exists on the server.
The client MUST monotonically increment the sequence number for the
CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE
operations. This is true even in the event that the previous
operation that used the sequence number received an error. The only
exception to this rule is if the previous operation received one of
the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE.
6.1.6. Recovery from Replayed Requests
As described above, the sequence number is per lock_owner. As long
as the server maintains the last sequence number received and follows
the methods described above, there are no risks of a Byzantine router
re-sending old requests. The server need only maintain the
(lock_owner, sequence number) state as long as there are open files
or closed files with locks outstanding.
LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence
number and therefore the risk of the replay of these operations
resulting in undesired effects is non-existent while the server
maintains the lock_owner state.
6.1.7. Releasing lock_owner State
When a particular lock_owner no longer holds open or file locking
state at the server, the server may choose to release the sequence
number state associated with the lock_owner. The server may make
this choice based on lease expiration, for the reclamation of server
memory, or other implementation specific details. In any event, the
server is able to do this safely only when the lock_owner no longer
is being utilized by the client. The server may choose to hold the
lock_owner state in the event that retransmitted requests are
received. However, the period to hold this state is implementation
specific.
In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is
retransmitted after the server has previously released the lock_owner
state, the server will find that the lock_owner has no files open and
an error will be returned to the client. If the lock_owner does have
a file open, the stateid will not match and again an error is
returned to the client.
6.1.8. Use of Open Confirmation
In the case that an OPEN is retransmitted and the lock_owner is being
used for the first time or the lock_owner state has been previously
released by the server, the use of the OPEN_CONFIRM operation will
prevent incorrect behavior. When the server observes the use of the
lock_owner for the first time, it will direct the client to perform
the OPEN_CONFIRM for the corresponding OPEN. This sequence
establishes the use of an lock_owner and associated sequence number.
Since the OPEN_CONFIRM sequence connects a new open_owner on the
server with an existing open_owner on a client, the sequence number
may have any value. The OPEN_CONFIRM step assures the server that
the value received is the correct one. See the section "OPEN_CONFIRM
- Confirm Open" for further details.
There are a number of situations in which the requirement to confirm
an OPEN would pose difficulties for the client and server, in that
they would be prevented from acting in a timely fashion on
information received, because that information would be provisional,
subject to deletion upon non-confirmation. Fortunately, these are
situations in which the server can avoid the need for confirmation
when responding to open requests. The two constraints are:
o The server must not bestow a delegation for any open which would
require confirmation.
o The server MUST NOT require confirmation on a reclaim-type open
(i.e. one specifying claim type CLAIM_PREVIOUS or
CLAIM_DELEGATE_PREV).
These constraints are related in that reclaim-type opens are the only
ones in which the server may be required to send a delegation. For
CLAIM_NULL, sending the delegation is optional while for
CLAIM_DELEGATE_CUR, no delegation is sent.
Delegations being sent with an open requiring confirmation are
troublesome because recovering from non-confirmation adds undue
complexity to the protocol while requiring confirmation on reclaim-
type opens poses difficulties in that the inability to resolve the
status of the reclaim until lease expiration may make it difficult to
have timely determination of the set of locks being reclaimed (since
the grace period may expire).
Requiring open confirmation on reclaim-type opens is avoidable
because of the nature of the environments in which such opens are
done. For CLAIM_PREVIOUS opens, this is immediately after server
reboot, so there should be no time for lockowners to be created,
found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens, we
are dealing with a client reboot situation. A server which supports
delegation can be sure that no lockowners for that client have been
recycled since client initialization and thus can ensure that
confirmation will not be required.
6.2. Lock Ranges
The protocol allows a lock owner to request a lock with a byte range
and then either upgrade or unlock a sub-range of the initial lock.
It is expected that this will be an uncommon type of request. In any
case, servers or server filesystems may not be able to support sub-
range lock semantics. In the event that a server receives a locking
request that represents a sub-range of current locking state for the
lock owner, the server is allowed to return the error
NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock
operations. Therefore, the client should be prepared to receive this
error and, if appropriate, report the error to the requesting
application.
The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to
the recovery of file locking state in the event of server failure.
As discussed in the section "Server Failure and Recovery" below, the
server may employ certain optimizations during recovery that work
effectively only when the client's behavior during lock recovery is
similar to the client's locking behavior prior to server failure.
6.3. Upgrading and Downgrading Locks
If a client has a write lock on a record, it can request an atomic
downgrade of the lock to a read lock via the LOCK request, by setting
the type to READ_LT. If the server supports atomic downgrade, the
request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP.
The client should be prepared to receive this error, and if
appropriate, report the error to the requesting application.
If a client has a read lock on a record, it can request an atomic
upgrade of the lock to a write lock via the LOCK request by setting
the type to WRITE_LT or WRITEW_LT. If the server does not support
atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade
can be achieved without an existing conflict, the request will
succeed. Otherwise, the server will return either NFS4ERR_DENIED or
NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the
client issued the LOCK request with the type set to WRITEW_LT and the
server has detected a deadlock. The client should be prepared to
receive such errors and if appropriate, report the error to the
requesting application.
6.4. Blocking Locks
Some clients require the support of blocking locks. The NFS version
4 protocol must not rely on a callback mechanism and therefore is
unable to notify a client when a previously denied lock has been
granted. Clients have no choice but to continually poll for the
lock. This presents a fairness problem. Two new lock types are
added, READW and WRITEW, and are used to indicate to the server that
the client is requesting a blocking lock. The server should maintain
an ordered list of pending blocking locks. When the conflicting lock
is released, the server may wait the lease period for the first
waiting client to re-request the lock. After the lease period
expires the next waiting client request is allowed the lock. Clients
are required to poll at an interval sufficiently small that it is
likely to acquire the lock in a timely manner. The server is not
required to maintain a list of pending blocked locks as it is used to
increase fairness and not correct operation. Because of the
unordered nature of crash recovery, storing of lock state to stable
storage would be required to guarantee ordered granting of blocking
locks.
Servers may also note the lock types and delay returning denial of
the request to allow extra time for a conflicting lock to be
released, allowing a successful return. In this way, clients can
avoid the burden of needlessly frequent polling for blocking locks.
The server should take care in the length of delay in the event the
client retransmits the request.
6.5. Lease Renewal
The purpose of a lease is to allow a server to remove stale locks
that are held by a client that has crashed or is otherwise
unreachable. It is not a mechanism for cache consistency and lease
renewals may not be denied if the lease interval has not expired.
The following events cause implicit renewal of all of the leases for
a given client (i.e. all those sharing a given clientid). Each of
these is a positive indication that the client is still active and
that the associated state held at the server, for the client, is
still valid.
o An OPEN with a valid clientid.
o Any operation made with a valid stateid (CLOSE, DELEGRETURN, LOCK,
LOCKU, OPEN, OPEN_CONFIRM, OPEN_DOWNGRADE, READ, SETATTR, WRITE).
This does not include the special stateids of all bits 0 or all
bits 1.
Note that if the client had restarted or rebooted, the client
would not be making these requests without issuing the
SETCLIENTID/SETCLIENTID_CONFIRM sequence. The use of the
SETCLIENTID/SETCLIENTID_CONFIRM sequence (one that changes the
client verifier) notifies the server to drop the locking state
associated with the client. SETCLIENTID/SETCLIENTID_CONFIRM never
renews a lease.
If the server has rebooted, the stateids (NFS4ERR_STALE_STATEID
error) or the clientid (NFS4ERR_STALE_CLIENTID error) will not be
valid hence preventing spurious renewals.
This approach allows for low overhead lease renewal which scales
well. In the typical case no extra RPC calls are required for lease
renewal and in the worst case one RPC is required every lease period
(i.e. a RENEW operation). The number of locks held by the client is
not a factor since all state for the client is involved with the
lease renewal action.
Since all operations that create a new lease also renew existing
leases, the server must maintain a common lease expiration time for
all valid leases for a given client. This lease time can then be
easily updated upon implicit lease renewal actions.
6.6. Crash Recovery
The important requirement in crash recovery is that both the client
and the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server
restarts or reboots. All READ and WRITE operations that may have
been queued within the client or network buffers must wait until the
client has successfully recovered the locks protecting the READ and
WRITE operations.
6.6.1. Client Failure and Recovery
In the event that a client fails, the server may recover the client's
locks when the associated leases have expired. Conflicting locks
from another client may only be granted after this lease expiration.
If the client is able to restart or reinitialize within the lease
period the client may be forced to wait the remainder of the lease
period before obtaining new locks.
To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client supplied verifier. This
verifier is part of the initial SETCLIENTID call made by the client.
The server returns a clientid as a result of the SETCLIENTID
operation. The client then confirms the use of the clientid with
SETCLIENTID_CONFIRM. The clientid in combination with an opaque
owner field is then used by the client to identify the lock owner for
OPEN. This chain of associations is then used to identify all locks
for a particular client.
Since the verifier will be changed by the client upon each
initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release
all locks held which are associated with the old clientid which was
derived from the old verifier.
Note that the verifier must have the same uniqueness properties of
the verifier for the COMMIT operation.
6.6.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart
or reboot), it must allow clients time to discover this fact and re-
establish the lost locking state. The client must be able to re-
establish the locking state without having the server deny valid
requests because the server has granted conflicting access to another
client. Likewise, if there is the possibility that clients have not
yet re-established their locking state for a file, the server must
disallow READ and WRITE operations for that file. The duration of
this recovery period is equal to the duration of the lease period.
A client can determine that server failure (and thus loss of locking
state) has occurred, when it receives one of two errors. The
NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a
reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a
clientid invalidated by reboot or restart. When either of these are
received, the client must establish a new clientid (See the section
"Client ID") and re-establish the locking state as discussed below.
The period of special handling of locking and READs and WRITEs, equal
in duration to the lease period, is referred to as the "grace
period". During the grace period, clients recover locks and the
associated state by reclaim-type locking requests (i.e. LOCK
requests with reclaim set to true and OPEN operations with a claim
type of CLAIM_PREVIOUS). During the grace period, the server must
reject READ and WRITE operations and non-reclaim locking requests
(i.e. other LOCK and OPEN operations) with an error of NFS4ERR_GRACE.
If the server can reliably determine that granting a non-reclaim
request will not conflict with reclamation of locks by other clients,
the NFS4ERR_GRACE error does not have to be returned and the non-
reclaim client request can be serviced. For the server to be able to
service READ and WRITE operations during the grace period, it must
again be able to guarantee that no possible conflict could arise
between an impending reclaim locking request and the READ or WRITE
operation. If the server is unable to offer that guarantee, the
NFS4ERR_GRACE error must be returned to the client.
For a server to provide simple, valid handling during the grace
period, the easiest method is to simply reject all non-reclaim
locking requests and READ and WRITE operations by returning the
NFS4ERR_GRACE error. However, a server may keep information about
granted locks in stable storage. With this information, the server
could determine if a regular lock or READ or WRITE operation can be
safely processed.
For example, if a count of locks on a given file is available in
stable storage, the server can track reclaimed locks for the file and
when all reclaims have been processed, non-reclaim locking requests
may be processed. This way the server can ensure that non-reclaim
locking requests will not conflict with potential reclaim requests.
With respect to I/O requests, if the server is able to determine that
there are no outstanding reclaim requests for a file by information
from stable storage or another similar mechanism, the processing of
I/O requests could proceed normally for the file.
To reiterate, for a server that allows non-reclaim lock and I/O
requests to be processed during the grace period, it MUST determine
that no lock subsequently reclaimed will be rejected and that no lock
subsequently reclaimed would have prevented any I/O operation
processed during the grace period.
Clients should be prepared for the return of NFS4ERR_GRACE errors for
non-reclaim lock and I/O requests. In this case the client should
employ a retry mechanism for the request. A delay (on the order of
several seconds) between retries should be used to avoid overwhelming
the server. Further discussion of the general issue is included in
[Floyd]. The client must account for the server that is able to
perform I/O and non-reclaim locking requests within the grace period
as well as those that can not do so.
A reclaim-type locking request outside the server's grace period can
only succeed if the server can guarantee that no conflicting lock or
I/O request has been granted since reboot or restart.
A server may, upon restart, establish a new value for the lease
period. Therefore, clients should, once a new clientid is
established, refetch the lease_time attribute and use it as the basis
for lease renewal for the lease associated with that server.
However, the server must establish, for this restart event, a grace
period at least as long as the lease period for the previous server
instantiation. This allows the client state obtained during the
previous server instance to be reliably re-established.
6.6.3. Network Partitions and Recovery
If the duration of a network partition is greater than the lease
period provided by the server, the server will have not received a
lease renewal from the client. If this occurs, the server may free
all locks held for the client. As a result, all stateids held by the
client will become invalid or stale. Once the client is able to
reach the server after such a network partition, all I/O submitted by
the client with the now invalid stateids will fail with the server
returning the error NFS4ERR_EXPIRED. Once this error is received,
the client will suitably notify the application that held the lock.
As a courtesy to the client or as an optimization, the server may
continue to hold locks on behalf of a client for which recent
communication has extended beyond the lease period. If the server
receives a lock or I/O request that conflicts with one of these
courtesy locks, the server must free the courtesy lock and grant the
new request.
When a network partition is combined with a server reboot, there are
edge conditions that place requirements on the server in order to
avoid silent data corruption following the server reboot. Two of
these edge conditions are known, and are discussed below.
The first edge condition has the following scenario:
1. Client A acquires a lock.
2. Client A and server experience mutual network partition, such
that client A is unable to renew its lease.
3. Client A's lease expires, so server releases lock.
4. Client B acquires a lock that would have conflicted with that of
Client A.
5. Client B releases the lock
6. Server reboots
7. Network partition between client A and server heals.
8. Client A issues a RENEW operation, and gets back a
NFS4ERR_STALE_CLIENTID.
9. Client A reclaims its lock within the server's grace period.
Thus, at the final step, the server has erroneously granted client
A's lock reclaim. If client B modified the object the lock was
protecting, client A will experience object corruption.
The second known edge condition follows:
1. Client A acquires a lock.
2. Server reboots.
3. Client A and server experience mutual network partition, such
that client A is unable to reclaim its lock within the grace
period.
4. Server's reclaim grace period ends. Client A has no locks
recorded on server.
5. Client B acquires a lock that would have conflicted with that of
Client A.
6. Client B releases the lock
7. Server reboots a second time
8. Network partition between client A and server heals.
9. Client A issues a RENEW operation, and gets back a
NFS4ERR_STALE_CLIENTID.
10. Client A reclaims its lock within the server's grace period.
As with the first edge condition, the final step of the scenario of
the second edge condition has the server erroneously granting client
A's lock reclaim.
Solving the first and second edge conditions requires that the server
either assume after it reboots that edge condition occurs, and thus
return NFS4ERR_NO_GRACE for all reclaim attempts, or that the server
record some information stable storage. The amount of information
the server records in stable storage is in inverse proportion to how
harsh the server wants to be whenever the edge conditions occur. The
server that is completely tolerant of all edge conditions will record
in stable storage every lock that is acquired, removing the lock
record from stable storage only when the lock is unlocked by the
client and the lock's lockowner advances the sequence number such
that the lock release is not the last stateful event for the
lockowner's sequence. For the two aforementioned edge conditions,
the harshest a server can be, and still support a grace period for
reclaims, requires that the server record in stable storage
information some minimal information. For example, a server
implementation could, for each client, save in stable storage a
record containing:
o the client's id string
o a boolean that indicates if the client's lease expired or if there
was administrative intervention (see the section, Server
Revocation of Locks) to revoke a record lock, share reservation,
or delegation
o a timestamp that is updated the first time after a server boot or
reboot the client acquires record locking, share reservation, or
delegation state on the server. The timestamp need not be updated
on subsequent lock requests until the server reboots.
The server implementation would also record in the stable storage the
timestamps from the two most recent server reboots.
Assuming the above record keeping, for the first edge condition,
after the server reboots, the record that client A's lease expired
means that another client could have acquired a conflicting record
lock, share reservation, or delegation. Hence the server must reject
a reclaim from client A with the error NFS4ERR_NO_GRACE.
For the second edge condition, after the server reboots for a second
time, the record that the client had an unexpired record lock, share
reservation, or delegation established before the server's previous
incarnation means that the server must reject a reclaim from client A
with the error NFS4ERR_NO_GRACE.
Regardless of the level and approach to record keeping, the server
MUST implement one of the following strategies (which apply to
reclaims of share reservations, record locks, and delegations):
1. Reject all reclaims with NFS4ERR_NO_GRACE. This is superharsh,
but necessary if the server does not want to record lock state in
stable storage.
2. Record sufficient state in stable storage such that all known
edge conditions involving server reboot, including the two noted
in this section, are detected. False positives are acceptable.
Note that at this time, it is not known if there are other edge
conditions.
In the event, after a server reboot, the server determines that
there is unrecoverable damage or corruption to the the stable
storage, then for all clients and/or locks affected, the server
MUST return NFS4ERR_NO_GRACE.
A mandate for the client's handling of the NFS4ERR_NO_GRACE error is
outside the scope of this specification, since the strategies for
such handling are very dependent on the client's operating
environment. However, one potential approach is described below.
When the client receives NFS4ERR_NO_GRACE, it could examine the
change attribute of the objects the client is trying to reclaim state
for, and use that to determine whether to re-establish the state via
normal OPEN or LOCK requests. This is acceptable provided the
client's operating environment allows it. In otherwords, the client
implementor is advised to document for his users the behavior. The
client could also inform the application that its record lock or
share reservations (whether they were delegated or not) have been
lost, such as via a UNIX signal, a GUI pop-up window, etc. See the
section, "Data Caching and Revocation" for a discussion of what the
client should do for dealing with unreclaimed delegations on client
state.
For further discussion of revocation of locks see the section "Server
Revocation of Locks".
6.7. Recovery from a Lock Request Timeout or Abort
In the event a lock request times out, a client may decide to not
retry the request. The client may also abort the request when the
process for which it was issued is terminated (e.g. in UNIX due to a
signal). It is possible though that the server received the request
and acted upon it. This would change the state on the server without
the client being aware of the change. It is paramount that the
client re-synchronize state with server before it attempts any other
operation that takes a seqid and/or a stateid with the same
lock_owner. This is straightforward to do without a special re-
synchronize operation.
Since the server maintains the last lock request and response
received on the lock_owner, for each lock_owner, the client should
cache the last lock request it sent such that the lock request did
not receive a response. From this, the next time the client does a
lock operation for the lock_owner, it can send the cached request, if
there is one, and if the request was one that established state (e.g.
a LOCK or OPEN operation), the server will return the cached result
or if never saw the request, perform it. The client can follow up
with a request to remove the state (e.g. a LOCKU or CLOSE operation).
With this approach, the sequencing and stateid information on the
client and server for the given lock_owner will re-synchronize and in
turn the lock state will re-synchronize.
6.8. Server Revocation of Locks
At any point, the server can revoke locks held by a client and the
client must be prepared for this event. When the client detects that
its locks have been or may have been revoked, the client is
responsible for validating the state information between itself and
the server. Validating locking state for the client means that it
must verify or reclaim state for each lock currently held.
The first instance of lock revocation is upon server reboot or re-
initialization. In this instance the client will receive an error
(NFS4ERR_STALE_STATEID or NFS4ERR_STALE_CLIENTID) and the client will
proceed with normal crash recovery as described in the previous
section.
The second lock revocation event is the inability to renew the lease
before expiration. While this is considered a rare or unusual event,
the client must be prepared to recover. Both the server and client
will be able to detect the failure to renew the lease and are capable
of recovering without data corruption. For the server, it tracks the
last renewal event serviced for the client and knows when the lease
will expire. Similarly, the client must track operations which will
renew the lease period. Using the time that each such request was
sent and the time that the corresponding reply was received, the
client should bound the time that the corresponding renewal could
have occurred on the server and thus determine if it is possible that
a lease period expiration could have occurred.
The third lock revocation event can occur as a result of
administrative intervention within the lease period. While this is
considered a rare event, it is possible that the server's
administrator has decided to release or revoke a particular lock held
by the client. As a result of revocation, the client will receive an
error of NFS4ERR_ADMIN_REVOKED. In this instance the client may
assume that only the lock_owner's locks have been lost. The client
notifies the lock holder appropriately. The client may not assume
the lease period has been renewed as a result of failed operation.
When the client determines the lease period may have expired, the
client must mark all locks held for the associated lease as
"unvalidated". This means the client has been unable to re-establish
or confirm the appropriate lock state with the server. As described
in the previous section on crash recovery, there are scenarios in
which the server may grant conflicting locks after the lease period
has expired for a client. When it is possible that the lease period
has expired, the client must validate each lock currently held to
ensure that a conflicting lock has not been granted. The client may
accomplish this task by issuing an I/O request, either a pending I/O
or a zero-length read, specifying the stateid associated with the
lock in question. If the response to the request is success, the
client has validated all of the locks governed by that stateid and
re-established the appropriate state between itself and the server.
If the I/O request is not successful, then one or more of the locks
associated with the stateid was revoked by the server and the client
must notify the owner.
6.9. Share Reservations
A share reservation is a mechanism to control access to a file. It
is a separate and independent mechanism from record locking. When a
client opens a file, it issues an OPEN operation to the server
specifying the type of access required (READ, WRITE, or BOTH) and the
type of access to deny others (deny NONE, READ, WRITE, or BOTH). If
the OPEN fails the client will fail the application's open request.
Pseudo-code definition of the semantics:
if (request.access == 0)
return (NFS4ERR_INVAL)
else
if ((request.access & file_state.deny)) ||
(request.deny & file_state.access))
return (NFS4ERR_DENIED)
This checking of share reservations on OPEN is done with no exception
for an existing OPEN for the same open_owner.
The constants used for the OPEN and OPEN_DOWNGRADE operations for the
access and deny fields are as follows:
const OPEN4_SHARE_ACCESS_READ = 0x00000001;
const OPEN4_SHARE_ACCESS_WRITE = 0x00000002;
const OPEN4_SHARE_ACCESS_BOTH = 0x00000003;
const OPEN4_SHARE_DENY_NONE = 0x00000000;
const OPEN4_SHARE_DENY_READ = 0x00000001;
const OPEN4_SHARE_DENY_WRITE = 0x00000002;
const OPEN4_SHARE_DENY_BOTH = 0x00000003;
6.10. OPEN/CLOSE Operations
To provide correct share semantics, a client MUST use the OPEN
operation to obtain the initial filehandle and indicate the desired
access and what if any access to deny. Even if the client intends to
use a stateid of all 0's or all 1's, it must still obtain the
filehandle for the regular file with the OPEN operation so the
appropriate share semantics can be applied. For clients that do not
have a deny mode built into their open programming interfaces, deny
equal to NONE should be used.
The OPEN operation with the CREATE flag, also subsumes the CREATE
operation for regular files as used in previous versions of the NFS
protocol. This allows a create with a share to be done atomically.
The CLOSE operation removes all share reservations held by th