<
 draft-ietf-nfsv4-rfc3530bis-01.txt   draft-ietf-nfsv4-rfc3530bis-02.txt 
NFSv4 T. Haynes NFSv4 T. Haynes
Internet-Draft Editor Internet-Draft Editor
Intended status: Standards Track March 05, 2010 Intended status: Standards Track March 05, 2010
Expires: September 6, 2010 Expires: September 6, 2010
NFS Version 4 Protocol NFS Version 4 Protocol
draft-ietf-nfsv4-rfc3530bis-01.txt draft-ietf-nfsv4-rfc3530bis-02.txt
Abstract Abstract
The Network File System (NFS) version 4 is a distributed filesystem The Network File System (NFS) version 4 is a distributed filesystem
protocol which owes heritage to NFS protocol version 2, RFC 1094, and protocol which owes heritage to NFS protocol version 2, RFC 1094, and
version 3, RFC 1813. Unlike earlier versions, the NFS version 4 version 3, RFC 1813. Unlike earlier versions, the NFS version 4
protocol supports traditional file access while integrating support protocol supports traditional file access while integrating support
for file locking and the mount protocol. In addition, support for for file locking and the mount protocol. In addition, support for
strong security (and its negotiation), compound operations, client strong security (and its negotiation), compound operations, client
caching, and internationalization have been added. Of course, caching, and internationalization have been added. Of course,
skipping to change at page 3, line 7 skipping to change at page 3, line 7
modifications of such material outside the IETF Standards Process. modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other it for publication as an RFC or to translate it into languages other
than English. than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 7 1. XDR Description of NFSv4.1 . . . . . . . . . . . . . . . . . . 3
1.1. Changes since RFC 3530 . . . . . . . . . . . . . . . . . 7 2. Security Considerations . . . . . . . . . . . . . . . . . . . 36
1.2. Changes since RFC 3010 . . . . . . . . . . . . . . . . . 7 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 36
1.3. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 8 4. Normative References . . . . . . . . . . . . . . . . . . . . . 36
1.4. Inconsistencies of this Document with Section 18 . . . . 9 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.5. Overview of NFS version 4 Features . . . . . . . . . . . 9
1.5.1. RPC and Security . . . . . . . . . . . . . . . . . . 9
1.5.2. Procedure and Operation Structure . . . . . . . . . . 10
1.5.3. Filesystem Model . . . . . . . . . . . . . . . . . . 10
1.5.4. OPEN and CLOSE . . . . . . . . . . . . . . . . . . . 12
1.5.5. File locking . . . . . . . . . . . . . . . . . . . . 12
1.5.6. Client Caching and Delegation . . . . . . . . . . . . 13
1.6. General Definitions . . . . . . . . . . . . . . . . . . 13
2. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 15
2.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 15
2.2. Structured Data Types . . . . . . . . . . . . . . . . . 17
3. RPC and Security Flavor . . . . . . . . . . . . . . . . . . . 22
3.1. Ports and Transports . . . . . . . . . . . . . . . . . . 22
3.1.1. Client Retransmission Behavior . . . . . . . . . . . 23
3.2. Security Flavors . . . . . . . . . . . . . . . . . . . . 23
3.2.1. Security mechanisms for NFS version 4 . . . . . . . . 24
3.3. Security Negotiation . . . . . . . . . . . . . . . . . . 26
3.3.1. SECINFO . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2. Security Error . . . . . . . . . . . . . . . . . . . 27
3.3.3. Callback RPC Authentication . . . . . . . . . . . . . 27
4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 29
4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . . 29
4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . . 30
4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 30
4.2.1. General Properties of a Filehandle . . . . . . . . . 30
4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . . 31
4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . . 31
4.2.4. One Method of Constructing a Volatile Filehandle . . 33
4.3. Client Recovery from Filehandle Expiration . . . . . . . 33
5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 34
5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 35
5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 35
5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 36
5.4. Classification of Attributes . . . . . . . . . . . . . . 36
5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 37
5.6. Recommended Attributes - Definitions . . . . . . . . . . 39
5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 45
5.8. Interpreting owner and owner_group . . . . . . . . . . . 46
5.9. Character Case Attributes . . . . . . . . . . . . . . . 48
5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 48
5.11. Access Control Lists . . . . . . . . . . . . . . . . . . 49
5.11.1. ACE type . . . . . . . . . . . . . . . . . . . . . . 50
5.11.2. ACE Access Mask . . . . . . . . . . . . . . . . . . . 51
5.11.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . 53
5.11.4. ACE who . . . . . . . . . . . . . . . . . . . . . . . 55
5.11.5. Mode Attribute . . . . . . . . . . . . . . . . . . . 56
5.11.6. Mode and ACL Attribute . . . . . . . . . . . . . . . 57
5.11.7. mounted_on_fileid . . . . . . . . . . . . . . . . . . 57
6. Filesystem Migration and Replication . . . . . . . . . . . . 58
6.1. Replication . . . . . . . . . . . . . . . . . . . . . . 59
6.2. Migration . . . . . . . . . . . . . . . . . . . . . . . 59
6.3. Interpretation of the fs_locations Attribute . . . . . . 60
6.4. Filehandle Recovery for Migration or Replication . . . . 61
7. NFS Server Name Space . . . . . . . . . . . . . . . . . . . . 61
7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 61
7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 62
7.3. Server Pseudo Filesystem . . . . . . . . . . . . . . . . 62
7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 63
7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 63
7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 63
7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 63
7.8. Security Policy and Name Space Presentation . . . . . . 64
8. File Locking and Share Reservations . . . . . . . . . . . . . 65
8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 65
8.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . . 66
8.1.2. Server Release of Clientid . . . . . . . . . . . . . 69
8.1.3. lock_owner and stateid Definition . . . . . . . . . . 69
8.1.4. Use of the stateid and Locking . . . . . . . . . . . 71
8.1.5. Sequencing of Lock Requests . . . . . . . . . . . . . 73
8.1.6. Recovery from Replayed Requests . . . . . . . . . . . 74
8.1.7. Releasing lock_owner State . . . . . . . . . . . . . 74
8.1.8. Use of Open Confirmation . . . . . . . . . . . . . . 75
8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 76
8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 76
8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 77
8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 77
8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 78
8.6.1. Client Failure and Recovery . . . . . . . . . . . . . 78
8.6.2. Server Failure and Recovery . . . . . . . . . . . . . 79
8.6.3. Network Partitions and Recovery . . . . . . . . . . . 81
8.7. Recovery from a Lock Request Timeout or Abort . . . . . 84
8.8. Server Revocation of Locks . . . . . . . . . . . . . . . 85
8.9. Share Reservations . . . . . . . . . . . . . . . . . . . 86
8.10. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 87
8.10.1. Close and Retention of State Information . . . . . . 87
8.11. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 88
8.12. Short and Long Leases . . . . . . . . . . . . . . . . . 89
8.13. Clocks, Propagation Delay, and Calculating Lease
Expiration . . . . . . . . . . . . . . . . . . . . . . . 89
8.14. Migration, Replication and State . . . . . . . . . . . . 90
8.14.1. Migration and State . . . . . . . . . . . . . . . . . 90
8.14.2. Replication and State . . . . . . . . . . . . . . . . 91
8.14.3. Notification of Migrated Lease . . . . . . . . . . . 91
8.14.4. Migration and the Lease_time Attribute . . . . . . . 92
9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 93
9.1. Performance Challenges for Client-Side Caching . . . . . 93
9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 94
9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . . 95
9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 97
9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 98
9.3.2. Data Caching and File Locking . . . . . . . . . . . . 99
9.3.3. Data Caching and Mandatory File Locking . . . . . . . 100
9.3.4. Data Caching and File Identity . . . . . . . . . . . 101
9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 102
9.4.1. Open Delegation and Data Caching . . . . . . . . . . 104
9.4.2. Open Delegation and File Locks . . . . . . . . . . . 105
9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 106
9.4.4. Recall of Open Delegation . . . . . . . . . . . . . . 109
9.4.5. Clients that Fail to Honor Delegation Recalls . . . . 111
9.4.6. Delegation Revocation . . . . . . . . . . . . . . . . 111
9.5. Data Caching and Revocation . . . . . . . . . . . . . . 112
9.5.1. Revocation Recovery for Write Open Delegation . . . . 112
9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 113
9.7. Data and Metadata Caching and Memory Mapped Files . . . 115
9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 117
9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 118
10. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 119
11. Internationalization . . . . . . . . . . . . . . . . . . . . 122
11.1. Stringprep profile for the utf8str_cs type . . . . . . . 123
11.1.1. Intended applicability of the nfs4_cs_prep profile . 123
11.1.2. Character repertoire of nfs4_cs_prep . . . . . . . . 123
11.1.3. Mapping used by nfs4_cs_prep . . . . . . . . . . . . 123
11.1.4. Normalization used by nfs4_cs_prep . . . . . . . . . 124
11.1.5. Prohibited output for nfs4_cs_prep . . . . . . . . . 124
11.1.6. Bidirectional output for nfs4_cs_prep . . . . . . . . 124
11.2. Stringprep profile for the utf8str_cis type . . . . . . 124
11.2.1. Intended applicability of the nfs4_cis_prep profile . 125
11.2.2. Character repertoire of nfs4_cis_prep . . . . . . . . 125
11.2.3. Mapping used by nfs4_cis_prep . . . . . . . . . . . . 125
11.2.4. Normalization used by nfs4_cis_prep . . . . . . . . . 125
11.2.5. Prohibited output for nfs4_cis_prep . . . . . . . . . 125
11.2.6. Bidirectional output for nfs4_cis_prep . . . . . . . 126
11.3. Stringprep profile for the utf8str_mixed type . . . . . 126
11.3.1. Intended applicability of the nfs4_mixed_prep
profile . . . . . . . . . . . . . . . . . . . . . . . 126
11.3.2. Character repertoire of nfs4_mixed_prep . . . . . . . 126
11.3.3. Mapping used by nfs4_cis_prep . . . . . . . . . . . . 126
11.3.4. Normalization used by nfs4_mixed_prep . . . . . . . . 126
11.3.5. Prohibited output for nfs4_mixed_prep . . . . . . . . 126
11.3.6. Bidirectional output for nfs4_mixed_prep . . . . . . 127
11.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 127
12. Error Definitions . . . . . . . . . . . . . . . . . . . . . . 128
13. NFS version 4 Requests . . . . . . . . . . . . . . . . . . . 133
13.1. Compound Procedure . . . . . . . . . . . . . . . . . . . 133
13.2. Evaluation of a Compound Request . . . . . . . . . . . . 134
13.3. Synchronous Modifying Operations . . . . . . . . . . . . 135
13.4. Operation Values . . . . . . . . . . . . . . . . . . . . 135
14. NFS version 4 Procedures . . . . . . . . . . . . . . . . . . 135
14.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 135
14.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 136
14.3. Operation 3: ACCESS - Check Access Rights . . . . . . . 139
14.4. Operation 4: CLOSE - Close File . . . . . . . . . . . . 142
14.5. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 143
14.6. Operation 6: CREATE - Create a Non-Regular File Object . 146
14.7. Operation 7: DELEGPURGE - Purge Delegations Awaiting
Recovery . . . . . . . . . . . . . . . . . . . . . . . . 149
14.8. Operation 8: DELEGRETURN - Return Delegation . . . . . . 150
14.9. Operation 9: GETATTR - Get Attributes . . . . . . . . . 151
14.10. Operation 10: GETFH - Get Current Filehandle . . . . . . 153
14.11. Operation 11: LINK - Create Link to a File . . . . . . . 154
14.12. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 156
14.13. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 160
14.14. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 162
14.15. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 164
14.16. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 166
14.17. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 167
14.18. Operation 18: OPEN - Open a Regular File . . . . . . . . 169
14.19. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 179
14.20. Operation 20: OPEN_CONFIRM - Confirm Open . . . . . . . 181
14.21. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 184
14.22. Operation 22: PUTFH - Set Current Filehandle . . . . . . 185
14.23. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 186
14.24. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 188
14.25. Operation 25: READ - Read from File . . . . . . . . . . 188
14.26. Operation 26: READDIR - Read Directory . . . . . . . . . 191
14.27. Operation 27: READLINK - Read Symbolic Link . . . . . . 195
14.28. Operation 28: REMOVE - Remove Filesystem Object . . . . 196
14.29. Operation 29: RENAME - Rename Directory Entry . . . . . 199
14.30. Operation 30: RENEW - Renew a Lease . . . . . . . . . . 202
14.31. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 204
14.32. Operation 32: SAVEFH - Save Current Filehandle . . . . . 205
14.33. Operation 33: SECINFO - Obtain Available Security . . . 206
14.34. Operation 34: SETATTR - Set Attributes . . . . . . . . . 210
14.35. Operation 35: SETCLIENTID - Negotiate Clientid . . . . . 213
14.36. Operation 36: SETCLIENTID_CONFIRM - Confirm Clientid . . 216
14.37. Operation 37: VERIFY - Verify Same Attributes . . . . . 220
14.38. Operation 38: WRITE - Write to File . . . . . . . . . . 222
14.39. Operation 39: RELEASE_LOCKOWNER - Release Lockowner
State . . . . . . . . . . . . . . . . . . . . . . . . . 226
14.40. Operation 10044: ILLEGAL - Illegal operation . . . . . . 228
15. NFS version 4 Callback Procedures . . . . . . . . . . . . . . 228
15.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 229
15.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 229
15.2.7. Operation 3: CB_GETATTR - Get Attributes . . . . . . 231
15.2.8. Operation 4: CB_RECALL - Recall an Open Delegation . 232
15.2.9. Operation 10044: CB_ILLEGAL - Illegal Callback
Operation . . . . . . . . . . . . . . . . . . . . . . 234
16. Security Considerations . . . . . . . . . . . . . . . . . . . 234
17. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 236
17.1. Named Attribute Definition . . . . . . . . . . . . . . . 236
17.2. ONC RPC Network Identifiers (netids) . . . . . . . . . . 236
18. References . . . . . . . . . . . . . . . . . . . . . . . . . 238
18.1. Normative References . . . . . . . . . . . . . . . . . . 238
18.2. Informative References . . . . . . . . . . . . . . . . . 238
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 240
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 240
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 240
1. Introduction
1.1. Changes since RFC 3530
This document obsoletes RFC 3530 [10] as the authoritative document
describing NFSv4, without introducing any over-the-wire protocol
changes. The main changes from RFC 3530 are:
o The RPC definition has been moved to a companion document [2]
o Updates for the latest IETF intellectual property statements
o LIPKEY SPKM/3 has been moved from being mandatory to optional
o Some clarification on a client re-establishing callback
information to the new server if state has been migrated
1.2. Changes since RFC 3010
This definition of the NFS version 4 protocol replaces or obsoletes
the definition present in [11]. While portions of the two documents
have remained the same, there have been substantive changes in
others. The changes made between [11] and this document represent
implementation experience and further review of the protocol. While
some modifications were made for ease of implementation or
clarification, most updates represent errors or situations where the
[11] definition were untenable.
The following list is not all inclusive of all changes but presents
some of the most notable changes or additions made:
o The state model has added an open_owner4 identifier. This was
done to accommodate Posix based clients and the model they use for
file locking. For Posix clients, an open_owner4 would correspond
to a file descriptor potentially shared amongst a set of processes
and the lock_owner4 identifier would correspond to a process that
is locking a file.
o Clarifications and error conditions were added for the handling of
the owner and group attributes. Since these attributes are string
based (as opposed to the numeric uid/gid of previous versions of
NFS), translations may not be available and hence the changes
made.
o Clarifications for the ACL and mode attributes to address
evaluation and partial support.
o For identifiers that are defined as XDR opaque, limits were set on
their size.
o Added the mounted_on_filed attribute to allow Posix clients to
correctly construct local mounts.
o Modified the SETCLIENTID/SETCLIENTID_CONFIRM operations to deal
correctly with confirmation details along with adding the ability
to specify new client callback information. Also added
clarification of the callback information itself.
o Added a new operation LOCKOWNER_RELEASE to enable notifying the
server that a lock_owner4 will no longer be used by the client.
o RENEW operation changes to identify the client correctly and allow
for additional error returns.
o Verify error return possibilities for all operations.
o Remove use of the pathname4 data type from LOOKUP and OPEN in
favor of having the client construct a sequence of LOOKUP
operations to achieive the same effect.
o Clarification of the internationalization issues and adoption of
the new stringprep profile framework.
1.3. NFS Version 4 Goals
The NFS version 4 protocol is a further revision of the NFS protocol
defined already by versions 2 [12] and 3 [13]. It retains the
essential characteristics of previous versions: design for easy
recovery, independent of transport protocols, operating systems and
filesystems, simplicity, and good performance. The NFS version 4
revision has the following goals:
o Improved access and good performance on the Internet.
The protocol is designed to transit firewalls easily, perform well
where latency is high and bandwidth is low, and scale to very
large numbers of clients per server.
o Strong security with negotiation built into the protocol.
The protocol builds on the work of the ONCRPC working group in
supporting the RPCSEC_GSS protocol. Additionally, the NFS version
4 protocol provides a mechanism to allow clients and servers the
ability to negotiate security and require clients and servers to
support a minimal set of security schemes.
o Good cross-platform interoperability.
The protocol features a filesystem model that provides a useful,
common set of features that does not unduly favor one filesystem
or operating system over another.
o Designed for protocol extensions.
The protocol is designed to accept standard extensions that do not
compromise backward compatibility.
1.4. Inconsistencies of this Document with Section 18
Section 18, RPC Definition File, contains the definitions in XDR
description language of the constructs used by the protocol. Prior
to Section 18, several of the constructs are reproduced for purposes
of explanation. The reader is warned of the possibility of errors in
the reproduced constructs outside of Section 18. For any part of the
document that is inconsistent with Section 18, Section 18 is to be
considered authoritative.
1.5. Overview of NFS version 4 Features
To provide a reasonable context for the reader, the major features of
NFS version 4 protocol will be reviewed in brief. This will be done
to provide an appropriate context for both the reader who is familiar
with the previous versions of the NFS protocol and the reader that is
new to the NFS protocols. For the reader new to the NFS protocols,
there is still a fundamental knowledge that is expected. The reader
should be familiar with the XDR and RPC protocols as described in [3]
and [14]. A basic knowledge of filesystems and distributed
filesystems is expected as well.
1.5.1. RPC and Security
As with previous versions of NFS, the External Data Representation
(XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS
version 4 protocol are those defined in [3] and [14]. To meet end to
end security requirements, the RPCSEC_GSS framework [4] will be used
to extend the basic RPC security. With the use of RPCSEC_GSS,
various mechanisms can be provided to offer authentication,
integrity, and privacy to the NFS version 4 protocol. Kerberos V5
will be used as described in [15] to provide one security framework.
The LIPKEY GSS-API mechanism described in [5] will be used to provide
for the use of user password and server public key by the NFS version
4 protocol. With the use of RPCSEC_GSS, other mechanisms may also be
specified and used for NFS version 4 security.
To enable in-band security negotiation, the NFS version 4 protocol
has added a new operation which provides the client a method of
querying the server about its policies regarding which security
mechanisms must be used for access to the server's filesystem
resources. With this, the client can securely match the security
mechanism that meets the policies specified at both the client and
server.
1.5.2. Procedure and Operation Structure
A significant departure from the previous versions of the NFS
protocol is the introduction of the COMPOUND procedure. For the NFS
version 4 protocol, there are two RPC procedures, NULL and COMPOUND.
The COMPOUND procedure is defined in terms of operations and these
operations correspond more closely to the traditional NFS procedures.
With the use of the COMPOUND procedure, the client is able to build
simple or complex requests. These COMPOUND requests allow for a
reduction in the number of RPCs needed for logical filesystem
operations. For example, without previous contact with a server a
client will be able to read data from a file in one request by
combining LOOKUP, OPEN, and READ operations in a single COMPOUND RPC.
With previous versions of the NFS protocol, this type of single
request was not possible.
The model used for COMPOUND is very simple. There is no logical OR
or ANDing of operations. The operations combined within a COMPOUND
request are evaluated in order by the server. Once an operation
returns a failing result, the evaluation ends and the results of all
evaluated operations are returned to the client.
The NFS version 4 protocol continues to have the client refer to a
file or directory at the server by a "filehandle". The COMPOUND
procedure has a method of passing a filehandle from one operation to
another within the sequence of operations. There is a concept of a
"current filehandle" and "saved filehandle". Most operations use the
"current filehandle" as the filesystem object to operate upon. The
"saved filehandle" is used as temporary filehandle storage within a
COMPOUND procedure as well as an additional operand for certain
operations.
1.5.3. Filesystem Model
The general filesystem model used for the NFS version 4 protocol is
the same as previous versions. The server filesystem is hierarchical
with the regular files contained within being treated as opaque byte
streams. In a slight departure, file and directory names are encoded
with UTF-8 to deal with the basics of internationalization.
The NFS version 4 protocol does not require a separate protocol to
provide for the initial mapping between path name and filehandle.
Instead of using the older MOUNT protocol for this mapping, the
server provides a ROOT filehandle that represents the logical root or
top of the filesystem tree provided by the server. The server
provides multiple filesystems by gluing them together with pseudo
filesystems. These pseudo filesystems provide for potential gaps in
the path names between real filesystems.
1.5.3.1. Filehandle Types
In previous versions of the NFS protocol, the filehandle provided by
the server was guaranteed to be valid or persistent for the lifetime
of the filesystem object to which it referred. For some server
implementations, this persistence requirement has been difficult to
meet. For the NFS version 4 protocol, this requirement has been
relaxed by introducing another type of filehandle, volatile. With
persistent and volatile filehandle types, the server implementation
can match the abilities of the filesystem at the server along with
the operating environment. The client will have knowledge of the
type of filehandle being provided by the server and can be prepared
to deal with the semantics of each.
1.5.3.2. Attribute Types
The NFS version 4 protocol introduces three classes of filesystem or
file attributes. Like the additional filehandle type, the
classification of file attributes has been done to ease server
implementations along with extending the overall functionality of the
NFS protocol. This attribute model is structured to be extensible
such that new attributes can be introduced in minor revisions of the
protocol without requiring significant rework.
The three classifications are: mandatory, recommended and named
attributes. This is a significant departure from the previous
attribute model used in the NFS protocol. Previously, the attributes
for the filesystem and file objects were a fixed set of mainly UNIX
attributes. If the server or client did not support a particular
attribute, it would have to simulate the attribute the best it could.
Mandatory attributes are the minimal set of file or filesystem
attributes that must be provided by the server and must be properly
represented by the server. Recommended attributes represent
different filesystem types and operating environments. The
recommended attributes will allow for better interoperability and the
inclusion of more operating environments. The mandatory and
recommended attribute sets are traditional file or filesystem
attributes. The third type of attribute is the named attribute. A
named attribute is an opaque byte stream that is associated with a
directory or file and referred to by a string name. Named attributes
are meant to be used by client applications as a method to associate
application specific data with a regular file or directory.
One significant addition to the recommended set of file attributes is
the Access Control List (ACL) attribute. This attribute provides for
directory and file access control beyond the model used in previous
versions of the NFS protocol. The ACL definition allows for
specification of user and group level access control.
1.5.3.3. Filesystem Replication and Migration
With the use of a special file attribute, the ability to migrate or
replicate server filesystems is enabled within the protocol. The
filesystem locations attribute provides a method for the client to
probe the server about the location of a filesystem. In the event of
a migration of a filesystem, the client will receive an error when
operating on the filesystem and it can then query as to the new file
system location. Similar steps are used for replication, the client
is able to query the server for the multiple available locations of a
particular filesystem. From this information, the client can use its
own policies to access the appropriate filesystem location.
1.5.4. OPEN and CLOSE
The NFS version 4 protocol introduces OPEN and CLOSE operations. The
OPEN operation provides a single point where file lookup, creation,
and share semantics can be combined. The CLOSE operation also
provides for the release of state accumulated by OPEN.
1.5.5. File locking
With the NFS version 4 protocol, the support for byte range file
locking is part of the NFS protocol. The file locking support is
structured so that an RPC callback mechanism is not required. This
is a departure from the previous versions of the NFS file locking
protocol, Network Lock Manager (NLM). The state associated with file
locks is maintained at the server under a lease-based model. The
server defines a single lease period for all state held by a NFS
client. If the client does not renew its lease within the defined
period, all state associated with the client's lease may be released
by the server. The client may renew its lease with use of the RENEW
operation or implicitly by use of other operations (primarily READ).
1.5.6. Client Caching and Delegation
The file, attribute, and directory caching for the NFS version 4
protocol is similar to previous versions. Attributes and directory
information are cached for a duration determined by the client. At
the end of a predefined timeout, the client will query the server to
see if the related filesystem object has been updated.
For file data, the client checks its cache validity when the file is
opened. A query is sent to the server to determine if the file has
been changed. Based on this information, the client determines if
the data cache for the file should kept or released. Also, when the
file is closed, any modified data is written to the server.
If an application wants to serialize access to file data, file
locking of the file data ranges in question should be used.
The major addition to NFS version 4 in the area of caching is the
ability of the server to delegate certain responsibilities to the
client. When the server grants a delegation for a file to a client,
the client is guaranteed certain semantics with respect to the
sharing of that file with other clients. At OPEN, the server may
provide the client either a read or write delegation for the file.
If the client is granted a read delegation, it is assured that no
other client has the ability to write to the file for the duration of
the delegation. If the client is granted a write delegation, the
client is assured that no other client has read or write access to
the file.
Delegations can be recalled by the server. If another client
requests access to the file in such a way that the access conflicts
with the granted delegation, the server is able to notify the initial
client and recall the delegation. This requires that a callback path
exist between the server and client. If this callback path does not
exist, then delegations can not be granted. The essence of a
delegation is that it allows the client to locally service operations
such as OPEN, CLOSE, LOCK, LOCKU, READ, WRITE without immediate
interaction with the server.
1.6. General Definitions
The following definitions are provided for the purpose of providing
an appropriate context for the reader.
Client The "client" is the entity that accesses the NFS server's
resources. The client may be an application which contains the
logic to access the NFS server directly. The client may also be
the traditional operating system client remote filesystem services
for a set of applications.
In the case of file locking the client is the entity that
maintains a set of locks on behalf of one or more applications.
This client is responsible for crash or failure recovery for those
locks it manages.
Note that multiple clients may share the same transport and
multiple clients may exist on the same network node.
Clientid A 64-bit quantity used as a unique, short-hand reference to
a client supplied Verifier and ID. The server is responsible for
supplying the Clientid.
Lease An interval of time defined by the server for which the client
is irrevocably granted a lock. At the end of a lease period the
lock may be revoked if the lease has not been extended. The lock
must be revoked if a conflicting lock has been granted after the
lease interval.
All leases granted by a server have the same fixed interval. Note
that the fixed interval was chosen to alleviate the expense a
server would have in maintaining state about variable length
leases across server failures.
Lock The term "lock" is used to refer to both record (byte-range)
locks as well as share reservations unless specifically stated
otherwise.
Server The "Server" is the entity responsible for coordinating
client access to a set of filesystems.
Stable Storage NFS version 4 servers must be able to recover without
data loss from multiple power failures (including cascading power
failures, that is, several power failures in quick succession),
operating system failures, and hardware failure of components
other than the storage medium itself (for example, disk,
nonvolatile RAM).
Some examples of stable storage that are allowable for an NFS
server include:
1. Media commit of data, that is, the modified data has been
successfully written to the disk media, for example, the disk
platter.
2. An immediate reply disk drive with battery-backed on-drive
intermediate storage or uninterruptible power system (UPS).
3. Server commit of data with battery-backed intermediate storage
and recovery software.
4. Cache commit with uninterruptible power system (UPS) and
recovery software.
Stateid A 128-bit quantity returned by a server that uniquely
defines the open and locking state provided by the server for a
specific open or lock owner for a specific file.
Stateids composed of all bits 0 or all bits 1 have special meaning
and are reserved values.
Verifier A 64-bit quantity generated by the client that the server
can use to determine if the client has restarted and lost all
previous lock state.
2. Protocol Data Types
The syntax and semantics to describe the data types of the NFS
version 4 protocol are defined in the XDR [14] and RPC [3] documents.
The next sections build upon the XDR data types to define types and
structures specific to this protocol.
2.1. Basic Data Types
These are the base NFSv4 data types.
+---------------+---------------------------------------------------+
| Data Type | Definition |
+---------------+---------------------------------------------------+
| int32_t | typedef int int32_t; |
| uint32_t | typedef unsigned int uint32_t; |
| int64_t | typedef hyper int64_t; |
| uint64_t | typedef unsigned hyper uint64_t; |
| attrlist4 | typedef opaque attrlist4<>; |
| | Used for file/directory attributes. |
| bitmap4 | typedef uint32_t bitmap4<>; |
| | Used in attribute array encoding. |
| changeid4 | typedef uint64_t changeid4; |
| | Used in the definition of change_info4. |
| clientid4 | typedef uint64_t clientid4; |
| | Shorthand reference to client identification. |
| count4 | typedef uint32_t count4; |
| | Various count parameters (READ, WRITE, COMMIT). |
| length4 | typedef uint64_t length4; |
| | Describes LOCK lengths. |
| mode4 | typedef uint32_t mode4; |
| | Mode attribute data type. |
| nfs_cookie4 | typedef uint64_t nfs_cookie4; |
| | Opaque cookie value for READDIR. |
| nfs_fh4 | typedef opaque nfs_fh4<NFS4_FHSIZE>; |
| | Filehandle definition. |
| nfs_ftype4 | enum nfs_ftype4; |
| | Various defined file types. |
| nfsstat4 | enum nfsstat4; |
| | Return value for operations. |
| offset4 | typedef uint64_t offset4; |
| | Various offset designations (READ, WRITE, LOCK, |
| | COMMIT). |
| qop4 | typedef uint32_t qop4; |
| | Quality of protection designation in SECINFO. |
| sec_oid4 | typedef opaque sec_oid4<>; |
| | Security Object Identifier. The sec_oid4 data |
| | type is not really opaque. Instead it contains an |
| | ASN.1 OBJECT IDENTIFIER as used by GSS-API in the |
| | mech_type argument to GSS_Init_sec_context. See |
| | [6] for details. |
| seqid4 | typedef uint32_t seqid4; |
| | Sequence identifier used for file locking. |
| utf8string | typedef opaque utf8string<>; |
| | UTF-8 encoding for strings. |
| utf8str_cis | typedef utf8string utf8str_cis; |
| | Case-insensitive UTF-8 string. |
| utf8str_cs | typedef utf8string utf8str_cs; |
| | Case-sensitive UTF-8 string. |
| utf8str_mixed | typedef utf8string utf8str_mixed; |
| | UTF-8 strings with a case sensitive prefix and a |
| | case insensitive suffix. |
| component4 | typedef utf8str_cs component4; |
| | Represents path name components. |
| linktext4 | typedef utf8str_cs linktext4; |
| | Symbolic link contents. |
| pathname4 | typedef component4 pathname4<>; |
| | Represents path name for fs_locations. |
| nfs_lockid4 | typedef uint64_t nfs_lockid4; |
| verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; |
| | Verifier used for various operations (COMMIT, |
| | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE) |
| | NFS4_VERIFIER_SIZE is defined as 8. |
+---------------+---------------------------------------------------+
End of Base Data Types
Table 1
2.2. Structured Data Types
2.2.1. nfstime4
struct nfstime4 {
int64_t seconds;
uint32_t nseconds;
};
The nfstime4 structure gives the number of seconds and nanoseconds
since midnight or 0 hour January 1, 1970 Coordinated Universal Time
(UTC). Values greater than zero for the seconds field denote dates
after the 0 hour January 1, 1970. Values less than zero for the
seconds field denote dates before the 0 hour January 1, 1970. In
both cases, the nseconds field is to be added to the seconds field
for the final time representation. For example, if the time to be
represented is one-half second before 0 hour January 1, 1970, the
seconds field would have a value of negative one (-1) and the
nseconds fields would have a value of one-half second (500000000).
Values greater than 999,999,999 for nseconds are considered invalid.
This data type is used to pass time and date information. A server
converts to and from its local representation of time when processing
time values, preserving as much accuracy as possible. If the
precision of timestamps stored for a filesystem object is less than
defined, loss of precision can occur. An adjunct time maintenance
protocol is recommended to reduce client and server time skew.
2.2.2. time_how4
enum time_how4 {
SET_TO_SERVER_TIME4 = 0,
SET_TO_CLIENT_TIME4 = 1
};
2.2.3. settime4
union settime4 switch (time_how4 set_it) {
case SET_TO_CLIENT_TIME4:
nfstime4 time;
default:
void;
};
The above definitions are used as the attribute definitions to set
time values. If set_it is SET_TO_SERVER_TIME4, then the server uses
its local representation of time for the time value.
2.2.4. specdata4
struct specdata4 {
uint32_t specdata1; /* major device number */
uint32_t specdata2; /* minor device number */
};
This data type represents additional information for the device file
types NF4CHR and NF4BLK.
2.2.5. fsid4
struct fsid4 {
uint64_t major;
uint64_t minor;
};
This type is the filesystem identifier that is used as a mandatory
attribute.
2.2.6. fs_location4
struct fs_location4 {
utf8str_cis server<>;
pathname4 rootpath;
};
2.2.7. fs_locations4
struct fs_locations4 {
pathname4 fs_root;
fs_location4 locations<>;
};
The fs_location4 and fs_locations4 data types are used for the
fs_locations recommended attribute which is used for migration and
replication support.
2.2.8. fattr4
struct fattr4 {
bitmap4 attrmask;
attrlist4 attr_vals;
};
The fattr4 structure is used to represent file and directory
attributes.
The bitmap is a counted array of 32 bit integers used to contain bit
values. The position of the integer in the array that contains bit n
can be computed from the expression (n / 32) and its bit within that
integer is (n mod 32).
0 1
+-----------+-----------+-----------+--
| count | 31 .. 0 | 63 .. 32 |
+-----------+-----------+-----------+--
2.2.9. change_info4
struct change_info4 {
bool atomic;
changeid4 before;
changeid4 after;
};
This structure is used with the CREATE, LINK, REMOVE, RENAME
operations to let the client know the value of the change attribute
for the directory in which the target filesystem object resides.
2.2.10. clientaddr4
struct clientaddr4 {
/* see struct rpcb in RFC 1833 */
string r_netid<>; /* network id */
string r_addr<>; /* universal address */
};
The clientaddr4 structure is used as part of the SETCLIENTID
operation to either specify the address of the client that is using a
clientid or as part of the callback registration. The r_netid and
r_addr fields are specified in [16], but they are underspecified in
[16] as far as what they should look like for specific protocols.
For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the
US-ASCII string:
h1.h2.h3.h4.p1.p2
The prefix, "h1.h2.h3.h4", is the standard textual form for
representing an IPv4 address, which is always four octets long.
Assuming big-endian ordering, h1, h2, h3, and h4, are respectively,
the first through fourth octets each converted to ASCII-decimal.
Assuming big-endian ordering, p1 and p2 are, respectively, the first
and second octets each converted to ASCII-decimal. For example, if a
host, in big-endian order, has an address of 0x0A010307 and there is
a service listening on, in big endian order, port 0x020F (decimal
527), then the complete universal address is "10.1.3.7.2.15".
For TCP over IPv4 the value of r_netid is the string "tcp". For UDP
over IPv4 the value of r_netid is the string "udp".
For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the
US-ASCII string:
x1:x2:x3:x4:x5:x6:x7:x8.p1.p2
The suffix "p1.p2" is the service port, and is computed the same way
as with universal addresses for TCP and UDP over IPv4. The prefix,
"x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for
representing an IPv6 address as defined in Section 2.2 of [17].
Additionally, the two alternative forms specified in Section 2.2 of
[17] are also acceptable.
For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP
over IPv6 the value of r_netid is the string "udp6".
2.2.11. cb_client4
struct cb_client4 {
unsigned int cb_program;
clientaddr4 cb_location;
};
This structure is used by the client to inform the server of its call
back address; includes the program number and client address.
2.2.12. nfs_client_id4
struct nfs_client_id4 {
verifier4 verifier;
opaque id<NFS4_OPAQUE_LIMIT>;
};
This structure is part of the arguments to the SETCLIENTID operation.
NFS4_OPAQUE_LIMIT is defined as 1024.
2.2.13. open_owner4
struct open_owner4 {
clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT>;
};
This structure is used to identify the owner of open state.
NFS4_OPAQUE_LIMIT is defined as 1024.
2.2.14. lock_owner4
struct lock_owner4 {
clientid4 clientid;
opaque owner<NFS4_OPAQUE_LIMIT>;
};
This structure is used to identify the owner of file locking state.
NFS4_OPAQUE_LIMIT is defined as 1024.
2.2.15. open_to_lock_owner4
struct open_to_lock_owner4 {
seqid4 open_seqid;
stateid4 open_stateid;
seqid4 lock_seqid;
lock_owner4 lock_owner;
};
This structure is used for the first LOCK operation done for an
open_owner4. It provides both the open_stateid and lock_owner such
that the transition is made from a valid open_stateid sequence to
that of the new lock_stateid sequence. Using this mechanism avoids
the confirmation of the lock_owner/lock_seqid pair since it is tied
to established state in the form of the open_stateid/open_seqid.
2.2.16. stateid4
struct stateid4 {
uint32_t seqid;
opaque other[12];
};
This structure is used for the various state sharing mechanisms
between the client and server. For the client, this data structure
is read-only. The starting value of the seqid field is undefined.
The server is required to increment the seqid field monotonically at
each transition of the stateid. This is important since the client
will inspect the seqid in OPEN stateids to determine the order of
OPEN processing done by the server.
3. RPC and Security Flavor
The NFS version 4 protocol is a Remote Procedure Call (RPC)
application that uses RPC version 2 and the corresponding eXternal
Data Representation (XDR) as defined in [3] and [14]. The RPCSEC_GSS
security flavor as defined in [4] MUST be used as the mechanism to
deliver stronger security for the NFS version 4 protocol.
3.1. Ports and Transports
Historically, NFS version 2 and version 3 servers have resided on
port 2049. The registered port 2049 [18] for the NFS protocol should
be the default configuration. Using the registered port for NFS
services means the NFS client will not need to use the RPC binding
protocols as described in [16]; this will allow NFS to transit
firewalls.
Where an NFS version 4 implementation supports operation over the IP
network protocol, the supported transports between NFS and IP MUST be
among the IETF-approved congestion control transport protocols, which
include TCP and SCTP. To enhance the possibilities for
interoperability, an NFS version 4 implementation MUST support
operation over the TCP transport protocol, at least until such time
as a standards track RFC revises this requirement to use a different
IETF-approved congestion control transport protocol.
If TCP is used as the transport, the client and server SHOULD use
persistent connections. This will prevent the weakening of TCP's
congestion control via short lived connections and will improve
performance for the WAN environment by eliminating the need for SYN
handshakes.
As noted in the Security Considerations section, the authentication
model for NFS version 4 has moved from machine-based to principal-
based. However, this modification of the authentication model does
not imply a technical requirement to move the TCP connection
management model from whole machine-based to one based on a per user
model. In particular, NFS over TCP client implementations have
traditionally multiplexed traffic for multiple users over a common
TCP connection between an NFS client and server. This has been true,
regardless whether the NFS client is using AUTH_SYS, AUTH_DH,
RPCSEC_GSS or any other flavor. Similarly, NFS over TCP server
implementations have assumed such a model and thus scale the
implementation of TCP connection management in proportion to the
number of expected client machines. It is intended that NFS version
4 will not modify this connection management model. NFS version 4
clients that violate this assumption can expect scaling issues on the
server and hence reduced service.
Note that for various timers, the client and server should avoid
inadvertent synchronization of those timers. For further discussion
of the general issue refer to [19].
3.1.1. Client Retransmission Behavior
When processing a request received over a reliable transport such as
TCP, the NFS version 4 server MUST NOT silently drop the request,
except if the transport connection has been broken. Given such a
contract between NFS version 4 clients and servers, clients MUST NOT
retry a request unless one or both of the following are true:
o The transport connection has been broken
o The procedure being retried is the NULL procedure
Since reliable transports, such as TCP, do not always synchronously
inform a peer when the other peer has broken the connection (for
example, when an NFS server reboots), the NFS version 4 client may
want to actively "probe" the connection to see if has been broken.
Use of the NULL procedure is one recommended way to do so. So, when
a client experiences a remote procedure call timeout (of some
arbitrary implementation specific amount), rather than retrying the
remote procedure call, it could instead issue a NULL procedure call
to the server. If the server has died, the transport connection
break will eventually be indicated to the NFS version 4 client. The
client can then reconnect, and then retry the original request. If
the NULL procedure call gets a response, the connection has not
broken. The client can decide to wait longer for the original
request's response, or it can break the transport connection and
reconnect before re-sending the original request.
For callbacks from the server to the client, the same rules apply,
but the server doing the callback becomes the client, and the client
receiving the callback becomes the server.
3.2. Security Flavors
Traditional RPC implementations have included AUTH_NONE, AUTH_SYS,
AUTH_DH, and AUTH_KRB4 as security flavors. With [4] an additional
security flavor of RPCSEC_GSS has been introduced which uses the
functionality of GSS-API [6]. This allows for the use of various
security mechanisms by the RPC layer without the additional
implementation overhead of adding RPC security flavors. For NFS
version 4, the RPCSEC_GSS security flavor MUST be used to enable the
mandatory security mechanism. Other flavors, such as, AUTH_NONE,
AUTH_SYS, and AUTH_DH MAY be implemented as well.
3.2.1. Security mechanisms for NFS version 4
The use of RPCSEC_GSS requires selection of: mechanism, quality of
protection, and service (authentication, integrity, privacy). The
remainder of this document will refer to these three parameters of
the RPCSEC_GSS security as the security triple.
3.2.1.1. Kerberos V5 as a security triple
The Kerberos V5 GSS-API mechanism as described in [15] MUST be
implemented and provide the following security triples.
column descriptions:
1 == number of pseudo flavor
2 == name of pseudo flavor
3 == mechanism's OID
4 == mechanism's algorithm(s)
5 == RPCSEC_GSS service
1 2 3 4 5
--------------------------------------------------------------------
390003 krb5 1.2.840.113554.1.2.2 DES MAC MD5 rpc_gss_svc_none
390004 krb5i 1.2.840.113554.1.2.2 DES MAC MD5 rpc_gss_svc_integrity
390005 krb5p 1.2.840.113554.1.2.2 DES MAC MD5 rpc_gss_svc_privacy
for integrity,
and 56 bit DES
for privacy.
Note that the pseudo flavor is presented here as a mapping aid to the
implementor. Because this NFS protocol includes a method to
negotiate security and it understands the GSS-API mechanism, the
pseudo flavor is not needed. The pseudo flavor is needed for NFS
version 3 since the security negotiation is done via the MOUNT
protocol.
For a discussion of NFS' use of RPCSEC_GSS and Kerberos V5, please
see [20].
Users and implementors are warned that 56 bit DES is no longer
considered state of the art in terms of resistance to brute force
attacks. Once a revision to [15] is available that adds support for
AES, implementors are urged to incorporate AES into their NFSv4 over
Kerberos V5 protocol stacks, and users are similarly urged to migrate
to the use of AES.
3.2.1.2. LIPKEY as a security triple
The LIPKEY GSS-API mechanism as described in [5] MAY be implemented
and provide the following security triples. The definition of the
columns matches the previous subsection "Kerberos V5 as security
triple".
1 2 3 4 5
--------------------------------------------------------------------
390006 lipkey 1.3.6.1.5.5.9 negotiated rpc_gss_svc_none
390007 lipkey-i 1.3.6.1.5.5.9 negotiated rpc_gss_svc_integrity
390008 lipkey-p 1.3.6.1.5.5.9 negotiated rpc_gss_svc_privacy
The mechanism algorithm is listed as "negotiated". This is because
LIPKEY is layered on SPKM-3 and in SPKM-3 [5] the confidentiality and
integrity algorithms are negotiated. Since SPKM-3 specifies HMAC-MD5
for integrity as MANDATORY, 128 bit cast5CBC for confidentiality for
privacy as MANDATORY, and further specifies that HMAC-MD5 and
cast5CBC MUST be listed first before weaker algorithms, specifying
"negotiated" in column 4 does not impair interoperability. In the
event an SPKM-3 peer does not support the mandatory algorithms, the
other peer is free to accept or reject the GSS-API context creation.
Because SPKM-3 negotiates the algorithms, subsequent calls to
LIPKEY's GSS_Wrap() and GSS_GetMIC() by RPCSEC_GSS will use a quality
of protection value of 0 (zero). See section 5.2 of [21] for an
explanation.
LIPKEY uses SPKM-3 to create a secure channel in which to pass a user
name and password from the client to the server. Once the user name
and password have been accepted by the server, calls to the LIPKEY
context are redirected to the SPKM-3 context. See [5] for more
details.
3.2.1.3. SPKM-3 as a security triple
The SPKM-3 GSS-API mechanism as described in [5] MAY be implemented
and provide the following security triples. The definition of the
columns matches the previous subsection "Kerberos V5 as security
triple".
1 2 3 4 5
--------------------------------------------------------------------
390009 spkm3 1.3.6.1.5.5.1.3 negotiated rpc_gss_svc_none
390010 spkm3i 1.3.6.1.5.5.1.3 negotiated rpc_gss_svc_integrity
390011 spkm3p 1.3.6.1.5.5.1.3 negotiated rpc_gss_svc_privacy
For a discussion as to why the mechanism algorithm is listed as
"negotiated", see Section 3.2.1.2 "LIPKEY as a security triple."
Because SPKM-3 negotiates the algorithms, subsequent calls to SPKM-
3's GSS_Wrap() and GSS_GetMIC() by RPCSEC_GSS will use a quality of
protection value of 0 (zero). See section 5.2 of [21] for an
explanation.
Even though LIPKEY is layered over SPKM-3, SPKM-3 is specified as a
mandatory set of triples to handle the situations where the initiator
(the client) is anonymous or where the initiator has its own
certificate. If the initiator is anonymous, there will not be a user
name and password to send to the target (the server). If the
initiator has its own certificate, then using passwords is
superfluous.
3.3. Security Negotiation
With the NFS version 4 server potentially offering multiple security
mechanisms, the client needs a method to determine or negotiate which
mechanism is to be used for its communication with the server. The
NFS server may have multiple points within its filesystem name space
that are available for use by NFS clients. In turn the NFS server
may be configured such that each of these entry points may have
different or multiple security mechanisms in use.
The security negotiation between client and server must be done with
a secure channel to eliminate the possibility of a third party
intercepting the negotiation sequence and forcing the client and
server to choose a lower level of security than required or desired.
See Section 16 "Security Considerations" for further discussion.
3.3.1. SECINFO
The new SECINFO operation will allow the client to determine, on a
per filehandle basis, what security triple is to be used for server
access. In general, the client will not have to use the SECINFO
operation except during initial communication with the server or when
the client crosses policy boundaries at the server. It is possible
that the server's policies change during the client's interaction
therefore forcing the client to negotiate a new security triple.
3.3.2. Security Error
Based on the assumption that each NFS version 4 client and server
must support a minimum set of security (i.e., LIPKEY, SPKM-3, and
Kerberos-V5 all under RPCSEC_GSS), the NFS client will start its
communication with the server with one of the minimal security
triples. During communication with the server, the client may
receive an NFS error of NFS4ERR_WRONGSEC. This error allows the
server to notify the client that the security triple currently being
used is not appropriate for access to the server's filesystem
resources. The client is then responsible for determining what
security triples are available at the server and choose one which is
appropriate for the client. See Section 14.33 for the "SECINFO"
operation for further discussion of how the client will respond to
the NFS4ERR_WRONGSEC error and use SECINFO.
3.3.3. Callback RPC Authentication
Except as noted elsewhere in this section, the callback RPC
(described later) MUST mutually authenticate the NFS server to the
principal that acquired the clientid (also described later), using
the security flavor the original SETCLIENTID operation used.
For AUTH_NONE, there are no principals, so this is a non-issue.
AUTH_SYS has no notions of mutual authentication or a server
principal, so the callback from the server simply uses the AUTH_SYS
credential that the user used when he set up the delegation.
For AUTH_DH, one commonly used convention is that the server uses the
credential corresponding to this AUTH_DH principal:
unix.host@domain
where host and domain are variables corresponding to the name of
server host and directory services domain in which it lives such as a
Network Information System domain or a DNS domain.
Because LIPKEY is layered over SPKM-3, it is permissible for the
server to use SPKM-3 and not LIPKEY for the callback even if the
client used LIPKEY for SETCLIENTID.
Regardless of what security mechanism under RPCSEC_GSS is being used,
the NFS server, MUST identify itself in GSS-API via a
GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE
names are of the form:
service@hostname
For NFS, the "service" element is
nfs
Implementations of security mechanisms will convert nfs@hostname to
various different forms. For Kerberos V5 and LIPKEY, the following
form is RECOMMENDED:
nfs/hostname
For Kerberos V5, nfs/hostname would be a server principal in the
Kerberos Key Distribution Center database. This is the same
principal the client acquired a GSS-API context for when it issued
the SETCLIENTID operation, therefore, the realm name for the server
principal must be the same for the callback as it was for the
SETCLIENTID.
For LIPKEY, this would be the username passed to the target (the NFS
version 4 client that receives the callback).
It should be noted that LIPKEY may not work for callbacks, since the
LIPKEY client uses a user id/password. If the NFS client receiving
the callback can authenticate the NFS server's user name/password
pair, and if the user that the NFS server is authenticating to has a
public key certificate, then it works.
In situations where the NFS client uses LIPKEY and uses a per-host
principal for the SETCLIENTID operation, instead of using LIPKEY for
SETCLIENTID, it is RECOMMENDED that SPKM-3 with mutual authentication
be used. This effectively means that the client will use a
certificate to authenticate and identify the initiator to the target
on the NFS server. Using SPKM-3 and not LIPKEY has the following
advantages:
o When the server does a callback, it must authenticate to the
principal used in the SETCLIENTID. Even if LIPKEY is used,
because LIPKEY is layered over SPKM-3, the NFS client will need to
have a certificate that corresponds to the principal used in the
SETCLIENTID operation. From an administrative perspective, having
a user name, password, and certificate for both the client and
server is redundant.
o LIPKEY was intended to minimize additional infrastructure
requirements beyond a certificate for the target, and the
expectation is that existing password infrastructure can be
leveraged for the initiator. In some environments, a per-host
password does not exist yet. If certificates are used for any
per-host principals, then additional password infrastructure is
not needed.
o In cases when a host is both an NFS client and server, it can
share the same per-host certificate.
4. Filehandles
The filehandle in the NFS protocol is a per server unique identifier
for a filesystem object. The contents of the filehandle are opaque
to the client. Therefore, the server is responsible for translating
the filehandle to an internal representation of the filesystem
object.
4.1. Obtaining the First Filehandle
The operations of the NFS protocol are defined in terms of one or
more filehandles. Therefore, the client needs a filehandle to
initiate communication with the server. With the NFS version 2
protocol [12] and the NFS version 3 protocol [13], there exists an
ancillary protocol to obtain this first filehandle. The MOUNT
protocol, RPC program number 100005, provides the mechanism of
translating a string based filesystem path name to a filehandle which
can then be used by the NFS protocols.
The MOUNT protocol has deficiencies in the area of security and use
via firewalls. This is one reason that the use of the public
filehandle was introduced in [22] and [23]. With the use of the
public filehandle in combination with the LOOKUP operation in the NFS
version 2 and 3 protocols, it has been demonstrated that the MOUNT
protocol is unnecessary for viable interaction between NFS client and
server.
Therefore, the NFS version 4 protocol will not use an ancillary
protocol for translation from string based path names to a
filehandle. Two special filehandles will be used as starting points
for the NFS client.
4.1.1. Root Filehandle
The first of the special filehandles is the ROOT filehandle. The
ROOT filehandle is the "conceptual" root of the filesystem name space
at the NFS server. The client uses or starts with the ROOT
filehandle by employing the PUTROOTFH operation. The PUTROOTFH
operation instructs the server to set the "current" filehandle to the
ROOT of the server's file tree. Once this PUTROOTFH operation is
used, the client can then traverse the entirety of the server's file
tree with the LOOKUP operation. A complete discussion of the server
name space is in the section "NFS Server Name Space".
4.1.2. Public Filehandle
The second special filehandle is the PUBLIC filehandle. Unlike the
ROOT filehandle, the PUBLIC filehandle may be bound or represent an
arbitrary filesystem object at the server. The server is responsible
for this binding. It may be that the PUBLIC filehandle and the ROOT
filehandle refer to the same filesystem object. However, it is up to
the administrative software at the server and the policies of the
server administrator to define the binding of the PUBLIC filehandle
and server filesystem object. The client may not make any
assumptions about this binding. The client uses the PUBLIC
filehandle via the PUTPUBFH operation.
4.2. Filehandle Types
In the NFS version 2 and 3 protocols, there was one type of
filehandle with a single set of semantics. This type of filehandle
is termed "persistent" in NFS Version 4. The semantics of a
persistent filehandle remain the same as before. A new type of
filehandle introduced in NFS Version 4 is the "volatile" filehandle,
which attempts to accommodate certain server environments.
The volatile filehandle type was introduced to address server
functionality or implementation issues which make correct
implementation of a persistent filehandle infeasible. Some server
environments do not provide a filesystem level invariant that can be
used to construct a persistent filehandle. The underlying server
filesystem may not provide the invariant or the server's filesystem
programming interfaces may not provide access to the needed
invariant. Volatile filehandles may ease the implementation of
server functionality such as hierarchical storage management or
filesystem reorganization or migration. However, the volatile
filehandle increases the implementation burden for the client.
Since the client will need to handle persistent and volatile
filehandles differently, a file attribute is defined which may be
used by the client to determine the filehandle types being returned
by the server.
4.2.1. General Properties of a Filehandle
The filehandle contains all the information the server needs to
distinguish an individual file. To the client, the filehandle is
opaque. The client stores filehandles for use in a later request and
can compare two filehandles from the same server for equality by
doing a byte-by-byte comparison. However, the client MUST NOT
otherwise interpret the contents of filehandles. If two filehandles
from the same server are equal, they MUST refer to the same file.
Servers SHOULD try to maintain a one-to-one correspondence between
filehandles and files but this is not required. Clients MUST use
filehandle comparisons only to improve performance, not for correct
behavior. All clients need to be prepared for situations in which it
cannot be determined whether two filehandles denote the same object
and in such cases, avoid making invalid assumptions which might cause
incorrect behavior. Further discussion of filehandle and attribute
comparison in the context of data caching is presented in the section
"Data Caching and File Identity".
As an example, in the case that two different path names when
traversed at the server terminate at the same filesystem object, the
server SHOULD return the same filehandle for each path. This can
occur if a hard link is used to create two file names which refer to
the same underlying file object and associated data. For example, if
paths /a/b/c and /a/d/c refer to the same file, the server SHOULD
return the same filehandle for both path names traversals.
4.2.2. Persistent Filehandle
A persistent filehandle is defined as having a fixed value for the
lifetime of the filesystem object to which it refers. Once the
server creates the filehandle for a filesystem object, the server
MUST accept the same filehandle for the object for the lifetime of
the object. If the server restarts or reboots the NFS server must
honor the same filehandle value as it did in the server's previous
instantiation. Similarly, if the filesystem is migrated, the new NFS
server must honor the same filehandle as the old NFS server.
The persistent filehandle will be become stale or invalid when the
filesystem object is removed. When the server is presented with a
persistent filehandle that refers to a deleted object, it MUST return
an error of NFS4ERR_STALE. A filehandle may become stale when the
filesystem containing the object is no longer available. The file
system may become unavailable if it exists on removable media and the
media is no longer available at the server or the filesystem in whole
has been destroyed or the filesystem has simply been removed from the
server's name space (i.e., unmounted in a UNIX environment).
4.2.3. Volatile Filehandle
A volatile filehandle does not share the same longevity
characteristics of a persistent filehandle. The server may determine
that a volatile filehandle is no longer valid at many different
points in time. If the server can definitively determine that a
volatile filehandle refers to an object that has been removed, the
server should return NFS4ERR_STALE to the client (as is the case for
persistent filehandles). In all other cases where the server
determines that a volatile filehandle can no longer be used, it
should return an error of NFS4ERR_FHEXPIRED.
The mandatory attribute "fh_expire_type" is used by the client to
determine what type of filehandle the server is providing for a
particular filesystem. This attribute is a bitmask with the
following values:
FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a
persistent filehandle, which is valid until the object is removed
from the filesystem. The server will not return NFS4ERR_FHEXPIRED
for this filehandle. FH4_PERSISTENT is defined as a value in
which none of the bits specified below are set.
FH4_VOLATILE_ANY The filehandle may expire at any time, except as
specifically excluded (i.e., FH4_NO_EXPIRE_WITH_OPEN).
FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set.
If this bit is set, then the meaning of FH4_VOLATILE_ANY is
qualified to exclude any expiration of the filehandle when it is
open.
FH4_VOL_MIGRATION The filehandle will expire as a result of
migration. If FH4_VOL_ANY is set, FH4_VOL_MIGRATION is redundant.
FH4_VOL_RENAME The filehandle will expire during rename. This
includes a rename by the requesting client or a rename by any
other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant.
Servers which provide volatile filehandles that may expire while open
(i.e., if FH4_VOL_MIGRATION or FH4_VOL_RENAME is set or if
FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set), should
deny a RENAME or REMOVE that would affect an OPEN file of any of the
components leading to the OPEN file. In addition, the server should
deny all RENAME or REMOVE requests during the grace period upon
server restart.
Note that the bits FH4_VOL_MIGRATION and FH4_VOL_RENAME allow the
client to determine that expiration has occurred whenever a specific
event occurs, without an explicit filehandle expiration error from
the server. FH4_VOL_ANY does not provide this form of information.
In situations where the server will expire many, but not all
filehandles upon migration (e.g., all but those that are open),
FH4_VOLATILE_ANY (in this case with FH4_NOEXPIRE_WITH_OPEN) is a
better choice since the client may not assume that all filehandles
will expire when migration occurs, and it is likely that additional
expirations will occur (as a result of file CLOSE) that are separated
in time from the migration event itself.
4.2.4. One Method of Constructing a Volatile Filehandle
A volatile filehandle, while opaque to the client could contain:
[volatile bit = 1 | server boot time | slot | generation number]
o slot is an index in the server volatile filehandle table
o generation number is the generation number for the table entry/
slot
When the client presents a volatile filehandle, the server makes the
following checks, which assume that the check for the volatile bit
has passed. If the server boot time is less than the current server
boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return
NFS4ERR_BADHANDLE. If the generation number does not match, return
NFS4ERR_FHEXPIRED.
When the server reboots, the table is gone (it is volatile).
If volatile bit is 0, then it is a persistent filehandle with a
different structure following it.
4.3. Client Recovery from Filehandle Expiration
If possible, the client SHOULD recover from the receipt of an
NFS4ERR_FHEXPIRED error. The client must take on additional
responsibility so that it may prepare itself to recover from the
expiration of a volatile filehandle. If the server returns
persistent filehandles, the client does not need these additional
steps.
For volatile filehandles, most commonly the client will need to store
the component names leading up to and including the filesystem object
in question. With these names, the client should be able to recover
by finding a filehandle in the name space that is still available or
by starting at the root of the server's filesystem name space.
If the expired filehandle refers to an object that has been removed
from the filesystem, obviously the client will not be able to recover
from the expired filehandle.
It is also possible that the expired filehandle refers to a file that
has been renamed. If the file was renamed by another client, again
it is possible that the original client will not be able to recover.
However, in the case that the client itself is renaming the file and
the file is open, it is possible that the client may be able to
recover. The client can determine the new path name based on the
processing of the rename request. The client can then regenerate the
new filehandle based on the new path name. The client could also use
the compound operation mechanism to construct a set of operations
like:
RENAME A B
LOOKUP B
GETFH
Note that the COMPOUND procedure does not provide atomicity. This
example only reduces the overhead of recovering from an expired
filehandle.
5. File Attributes
To meet the requirements of extensibility and increased
interoperability with non-UNIX platforms, attributes must be handled
in a flexible manner. The NFS version 3 fattr3 structure contains a
fixed list of attributes that not all clients and servers are able to
support or care about. The fattr3 structure can not be extended as
new needs arise and it provides no way to indicate non-support. With
the NFS version 4 protocol, the client is able query what attributes
the server supports and construct requests with only those supported
attributes (or a subset thereof).
To this end, attributes are divided into three groups: mandatory,
recommended, and named. Both mandatory and recommended attributes
are supported in the NFS version 4 protocol by a specific and well-
defined encoding and are identified by number. They are requested by
setting a bit in the bit vector sent in the GETATTR request; the
server response includes a bit vector to list what attributes were
returned in the response. New mandatory or recommended attributes
may be added to the NFS protocol between major revisions by
publishing a standards-track RFC which allocates a new attribute
number value and defines the encoding for the attribute. See
Section 10 "Minor Versioning" for further discussion.
Named attributes are accessed by the new OPENATTR operation, which
accesses a hidden directory of attributes associated with a file
system object. OPENATTR takes a filehandle for the object and
returns the filehandle for the attribute hierarchy. The filehandle
for the named attributes is a directory object accessible by LOOKUP
or READDIR and contains files whose names represent the named
attributes and whose data bytes are the value of the attribute. For
example:
LOOKUP "foo" ; look up file
GETATTR attrbits
OPENATTR ; access foo's named attributes
LOOKUP "x11icon" ; look up specific attribute
READ 0,4096 ; read stream of bytes
Named attributes are intended for data needed by applications rather
than by an NFS client implementation. NFS implementors are strongly
encouraged to define their new attributes as recommended attributes
by bringing them to the IETF standards-track process.
The set of attributes which are classified as mandatory is
deliberately small since servers must do whatever it takes to support
them. A server should support as many of the recommended attributes
as possible but by their definition, the server is not required to
support all of them. Attributes are deemed mandatory if the data is
both needed by a large number of clients and is not otherwise
reasonably computable by the client when support is not provided on
the server.
Note that the hidden directory returned by OPENATTR is a convenience
for protocol processing. The client should not make any assumptions
about the server's implementation of named attributes and whether the
underlying filesystem at the server has a named attribute directory
or not. Therefore, operations such as SETATTR and GETATTR on the
named attribute directory are undefined.
5.1. Mandatory Attributes
These MUST be supported by every NFS version 4 client and server in
order to ensure a minimum level of interoperability. The server must
store and return these attributes and the client must be able to
function with an attribute set limited to these attributes. With
just the mandatory attributes some client functionality may be
impaired or limited in some ways. A client may ask for any of these
attributes to be returned by setting a bit in the GETATTR request and
the server must return their value.
5.2. Recommended Attributes
These attributes are understood well enough to warrant support in the
NFS version 4 protocol. However, they may not be supported on all
clients and servers. A client may ask for any of these attributes to
be returned by setting a bit in the GETATTR request but must handle
the case where the server does not return them. A client may ask for
the set of attributes the server supports and should not request
attributes the server does not support. A server should be tolerant
of requests for unsupported attributes and simply not return them
rather than considering the request an error. It is expected that
servers will support all attributes they comfortably can and only
fail to support attributes which are difficult to support in their
operating environments. A server should provide attributes whenever
they don't have to "tell lies" to the client. For example, a file
modification time should be either an accurate time or should not be
supported by the server. This will not always be comfortable to
clients but the client is better positioned decide whether and how to
fabricate or construct an attribute or whether to do without the
attribute.
5.3. Named Attributes
These attributes are not supported by direct encoding in the NFS
Version 4 protocol but are accessed by string names rather than
numbers and correspond to an uninterpreted stream of bytes which are
stored with the filesystem object. The name space for these
attributes may be accessed by using the OPENATTR operation. The
OPENATTR operation returns a filehandle for a virtual "attribute
directory" and further perusal of the name space may be done using
READDIR and LOOKUP operations on this filehandle. Named attributes
may then be examined or changed by normal READ and WRITE and CREATE
operations on the filehandles returned from READDIR and LOOKUP.
Named attributes may have attributes.
It is recommended that servers support arbitrary named attributes. A
client should not depend on the ability to store any named attributes
in the server's filesystem. If a server does support named
attributes, a client which is also able to handle them should be able
to copy a file's data and meta-data with complete transparency from
one location to another; this would imply that names allowed for
regular directory entries are valid for named attribute names as
well.
Names of attributes will not be controlled by this document or other
IETF standards track documents. See Section 17 "IANA Considerations"
for further discussion.
5.4. Classification of Attributes
Each of the Mandatory and Recommended attributes can be classified in
one of three categories: per server, per filesystem, or per
filesystem object. Note that it is possible that some per filesystem
attributes may vary within the filesystem. See the "homogeneous"
attribute for its definition. Note that the attributes
time_access_set and time_modify_set are not listed in this section
because they are write-only attributes corresponding to time_access
and time_modify, and are used in a special instance of SETATTR.
o The per server attribute is:
lease_time
o The per filesystem attributes are:
supp_attr, fh_expire_type, link_support, symlink_support,
unique_handles, aclsupport, cansettime, case_insensitive,
case_preserving, chown_restricted, files_avail, files_free,
files_total, fs_locations, homogeneous, maxfilesize, maxname,
maxread, maxwrite, no_trunc, space_avail, space_free, space_total,
time_delta
o The per filesystem object attributes are:
type, change, size, named_attr, fsid, rdattr_error, filehandle,
ACL, archive, fileid, hidden, maxlink, mimetype, mode, numlinks,
owner, owner_group, rawdev, space_used, system, time_access,
time_backup, time_create, time_metadata, time_modify,
mounted_on_fileid
For quota_avail_hard, quota_avail_soft, and quota_used see their
definitions below for the appropriate classification.
5.5. Mandatory Attributes - Definitions
+-----------------+----+------------+--------+----------------------+
| Name | Id | Data Type | Access | Description |
+-----------------+----+------------+--------+----------------------+
| supp_attr | 0 | bitmap | READ | The bit vector which |
| | | | | would retrieve all |
| | | | | mandatory and |
| | | | | recommended |
| | | | | attributes that are |
| | | | | supported for this |
| | | | | object. The scope of |
| | | | | this attribute |
| | | | | applies to all |
| | | | | objects with a |
| | | | | matching fsid. |
| type | 1 | nfs4_ftype | READ | The type of the |
| | | | | object (file, |
| | | | | directory, symlink, |
| | | | | etc.) |
| fh_expire_type | 2 | uint32 | READ | Server uses this to |
| | | | | specify filehandle |
| | | | | expiration behavior |
| | | | | to the client. See |
| | | | | Section 4 |
| | | | | "Filehandles" for |
| | | | | additional |
| | | | | description. |
| change | 3 | uint64 | READ | A value created by |
| | | | | the server that the |
| | | | | client can use to |
| | | | | determine if file |
| | | | | data, directory |
| | | | | contents or |
| | | | | attributes of the |
| | | | | object have been |
| | | | | modified. The server |
| | | | | may return the |
| | | | | object's |
| | | | | time_metadata |
| | | | | attribute for this |
| | | | | attribute's value |
| | | | | but only if the |
| | | | | filesystem object |
| | | | | can not be updated |
| | | | | more frequently than |
| | | | | the resolution of |
| | | | | time_metadata. |
| size | 4 | uint64 | R/W | The size of the |
| | | | | object in bytes. |
| link_support | 5 | bool | READ | True, if the |
| | | | | object's filesystem |
| | | | | supports hard links. |
| symlink_support | 6 | bool | READ | True, if the |
| | | | | object's filesystem |
| | | | | supports symbolic |
| | | | | links. |
| named_attr | 7 | bool | READ | True, if this object |
| | | | | has named |
| | | | | attributes. In other |
| | | | | words, object has a |
| | | | | non-empty named |
| | | | | attribute directory. |
| fsid | 8 | fsid4 | READ | Unique filesystem |
| | | | | identifier for the |
| | | | | filesystem holding |
| | | | | this object. fsid |
| | | | | contains major and |
| | | | | minor components |
| | | | | each of which are |
| | | | | uint64. |
| unique_handles | 9 | bool | READ | True, if two |
| | | | | distinct filehandles |
| | | | | guaranteed to refer |
| | | | | to two different |
| | | | | filesystem objects. |
| lease_time | 10 | nfs_lease4 | READ | Duration of leases |
| | | | | at server in |
| | | | | seconds. |
| rdattr_error | 11 | enum | READ | Error returned from |
| | | | | getattr during |
| | | | | readdir. |
| filehandle | 19 | nfs_fh4 | READ | The filehandle of |
| | | | | this object |
| | | | | (primarily for |
| | | | | readdir requests). |
+-----------------+----+------------+--------+----------------------+
Table 2
5.6. Recommended Attributes - Definitions
+-------------------+----+--------------+--------+------------------+
| Name | Id | Data Type | Access | Description |
+-------------------+----+--------------+--------+------------------+
| ACL | 12 | nfsace4<> | R/W | The access |
| | | | | control list for |
| | | | | the object. |
| aclsupport | 13 | uint32 | READ | Indicates what |
| | | | | types of ACLs |
| | | | | are supported on |
| | | | | the current |
| | | | | filesystem. |
| archive | 14 | bool | R/W | True, if this |
| | | | | file has been |
| | | | | archived since |
| | | | | the time of last |
| | | | | modification |
| | | | | (deprecated in |
| | | | | favor of |
| | | | | time_backup). |
| cansettime | 15 | bool | READ | True, if the |
| | | | | server is able |
| | | | | to change the |
| | | | | times for a |
| | | | | filesystem |
| | | | | object as |
| | | | | specified in a |
| | | | | SETATTR |
| | | | | operation. |
| case_insensitive | 16 | bool | READ | True, if |
| | | | | filename |
| | | | | comparisons on |
| | | | | this filesystem |
| | | | | are case |
| | | | | insensitive. |
| case_preserving | 17 | bool | READ | True, if |
| | | | | filename case on |
| | | | | this filesystem |
| | | | | are preserved. |
| chown_restricted | 18 | bool | READ | If TRUE, the |
| | | | | server will |
| | | | | reject any |
| | | | | request to |
| | | | | change either |
| | | | | the owner or the |
| | | | | group associated |
| | | | | with a file if |
| | | | | the caller is |
| | | | | not a privileged |
| | | | | user (for |
| | | | | example, "root" |
| | | | | in UNIX |
| | | | | operating |
| | | | | environments or |
| | | | | in Windows 2000 |
| | | | | the "Take |
| | | | | Ownership" |
| | | | | privilege). |
| fileid | 20 | uint64 | READ | A number |
| | | | | uniquely |
| | | | | identifying the |
| | | | | file within the |
| | | | | filesystem. |
| files_avail | 21 | uint64 | READ | File slots |
| | | | | available to |
| | | | | this user on the |
| | | | | filesystem |
| | | | | containing this |
| | | | | object - this |
| | | | | should be the |
| | | | | smallest |
| | | | | relevant limit. |
| files_free | 22 | uint64 | READ | Free file slots |
| | | | | on the |
| | | | | filesystem |
| | | | | containing this |
| | | | | object - this |
| | | | | should be the |
| | | | | smallest |
| | | | | relevant limit. |
| files_total | 23 | uint64 | READ | Total file slots |
| | | | | on the |
| | | | | filesystem |
| | | | | containing this |
| | | | | object. |
| fs_locations | 24 | fs_locations | READ | Locations where |
| | | | | this filesystem |
| | | | | may be found. If |
| | | | | the server |
| | | | | returns |
| | | | | NFS4ERR_MOVED as |
| | | | | an error, this |
| | | | | attribute MUST |
| | | | | be supported. |
| hidden | 25 | bool | R/W | True, if the |
| | | | | file is |
| | | | | considered |
| | | | | hidden with |
| | | | | respect to the |
| | | | | Windows API. |
| homogeneous | 26 | bool | READ | True, if this |
| | | | | object's |
| | | | | filesystem is |
| | | | | homogeneous, |
| | | | | i.e., are per |
| | | | | filesystem |
| | | | | attributes the |
| | | | | same for all |
| | | | | filesystem's |
| | | | | objects? |
| maxfilesize | 27 | uint64 | READ | Maximum |
| | | | | supported file |
| | | | | size for the |
| | | | | filesystem of |
| | | | | this object. |
| maxlink | 28 | uint32 | READ | Maximum number |
| | | | | of links for |
| | | | | this object. |
| maxname | 29 | uint32 | READ | Maximum filename |
| | | | | size supported |
| | | | | for this object. |
| maxread | 30 | uint64 | READ | Maximum read |
| | | | | size supported |
| | | | | for this object. |
| maxwrite | 31 | uint64 | READ | Maximum write |
| | | | | size supported |
| | | | | for this object. |
| | | | | This attribute |
| | | | | SHOULD be |
| | | | | supported if the |
| | | | | file is |
| | | | | writable. Lack |
| | | | | of this |
| | | | | attribute can |
| | | | | lead to the |
| | | | | client either |
| | | | | wasting |
| | | | | bandwidth or not |
| | | | | receiving the |
| | | | | best |
| | | | | performance. |
| mimetype | 32 | utf8<> | R/W | MIME body |
| | | | | type/subtype of |
| | | | | this object. |
| mode | 33 | mode4 | R/W | UNIX-style mode |
| | | | | and permission |
| | | | | bits for this |
| | | | | object. |
| no_trunc | 34 | bool | READ | True, if a name |
| | | | | longer than |
| | | | | name_max is |
| | | | | used, an error |
| | | | | be returned and |
| | | | | name is not |
| | | | | truncated. |
| numlinks | 35 | uint32 | READ | Number of hard |
| | | | | links to this |
| | | | | object. |
| owner | 36 | utf8<> | R/W | The string name |
| | | | | of the owner of |
| | | | | this object. |
| owner_group | 37 | utf8<> | R/W | The string name |
| | | | | of the group |
| | | | | ownership of |
| | | | | this object. |
| quota_avail_hard | 38 | uint64 | READ | For definition |
| | | | | see Section 5.10 |
| | | | | "Quota |
| | | | | Attributes" |
| | | | | below. |
| quota_avail_soft | 39 | uint64 | READ | For definition |
| | | | | see Section 5.10 |
| | | | | "Quota |
| | | | | Attributes" |
| | | | | below. |
| quota_used | 40 | uint64 | READ | For definition |
| | | | | see Section 5.10 |
| | | | | "Quota |
| | | | | Attributes" |
| | | | | below. |
| rawdev | 41 | specdata4 | READ | Raw device |
| | | | | identifier. UNIX |
| | | | | device |
| | | | | major/minor node |
| | | | | information. If |
| | | | | the value of |
| | | | | type is not |
| | | | | NF4BLK or |
| | | | | NF4CHR, the |
| | | | | value return |
| | | | | SHOULD NOT be |
| | | | | considered |
| | | | | useful. |
| space_avail | 42 | uint64 | READ | Disk space in |
| | | | | bytes available |
| | | | | to this user on |
| | | | | the filesystem |
| | | | | containing this |
| | | | | object - this |
| | | | | should be the |
| | | | | smallest |
| | | | | relevant limit. |
| space_free | 43 | uint64 | READ | Free disk space |
| | | | | in bytes on the |
| | | | | filesystem |
| | | | | containing this |
| | | | | object - this |
| | | | | should be the |
| | | | | smallest |
| | | | | relevant limit. |
| space_total | 44 | uint64 | READ | Total disk space |
| | | | | in bytes on the |
| | | | | filesystem |
| | | | | containing this |
| | | | | object. |
| space_used | 45 | uint64 | READ | Number of |
| | | | | filesystem bytes |
| | | | | allocated to |
| | | | | this object. |
| system | 46 | bool | R/W | True, if this |
| | | | | file is a |
| | | | | "system" file |
| | | | | with respect to |
| | | | | the Windows API. |
| time_access | 47 | nfstime4 | READ | The time of last |
| | | | | access to the |
| | | | | object by a read |
| | | | | that was |
| | | | | satisfied by the |
| | | | | server. |
| time_access_set | 48 | settime4 | WRITE | Set the time of |
| | | | | last access to |
| | | | | the object. |
| | | | | SETATTR use |
| | | | | only. |
| time_backup | 49 | nfstime4 | R/W | The time of last |
| | | | | backup of the |
| | | | | object. |
| time_create | 50 | nfstime4 | R/W | The time of |
| | | | | creation of the |
| | | | | object. This |
| | | | | attribute does |
| | | | | not have any |
| | | | | relation to the |
| | | | | traditional UNIX |
| | | | | file attribute |
| | | | | "ctime" or |
| | | | | "change time". |
| time_delta | 51 | nfstime4 | READ | Smallest useful |
| | | | | server time |
| | | | | granularity. |
| time_metadata | 52 | nfstime4 | READ | The time of last |
| | | | | meta-data |
| | | | | modification of |
| | | | | the object. |
| time_modify | 53 | nfstime4 | READ | The time of last |
| | | | | modification to |
| | | | | the object. |
| time_modify_set | 54 | settime4 | WRITE | Set the time of |
| | | | | last |
| | | | | modification to |
| | | | | the object. |
| | | | | SETATTR use |
| | | | | only. |
| mounted_on_fileid | 55 | uint64 | READ | Like fileid, but |
| | | | | if the target |
| | | | | filehandle is |
| | | | | the root of a |
| | | | | filesystem |
| | | | | return the |
| | | | | fileid of the |
| | | | | underlying |
| | | | | directory. |
+-------------------+----+--------------+--------+------------------+
Table 3
5.7. Time Access
As defined above, the time_access attribute represents the time of
last access to the object by a read that was satisfied by the server.
The notion of what is an "access" depends on server's operating
environment and/or the server's filesystem semantics. For example,
for servers obeying POSIX semantics, time_access would be updated
only by the READLINK, READ, and READDIR operations and not any of the
operations that modify the content of the object. Of course, setting
the corresponding time_access_set attribute is another way to modify
the time_access attribute.
Whenever the file object resides on a writable filesystem, the server
should make best efforts to record time_access into stable storage.
However, to mitigate the performance effects of doing so, and most
especially whenever the server is satisfying the read of the object's
content from its cache, the server MAY cache access time updates and
lazily write them to stable storage. It is also acceptable to give
administrators of the server the option to disable time_access
updates.
5.8. Interpreting owner and owner_group
The recommended attributes "owner" and "owner_group" (and also users
and groups within the "acl" attribute) are represented in terms of a
UTF-8 string. To avoid a representation that is tied to a particular
underlying implementation at the client or server, the use of the
UTF-8 string has been chosen. Note that section 6.1 of [24] provides
additional rationale. It is expected that the client and server will
have their own local representation of owner and owner_group that is
used for local storage or presentation to the end user. Therefore,
it is expected that when these attributes are transferred between the
client and server that the local representation is translated to a
syntax of the form "user@dns_domain". This will allow for a client
and server that do not use the same local representation the ability
to translate to a common syntax that can be interpreted by both.
Similarly, security principals may be represented in different ways
by different security mechanisms. Servers normally translate these
representations into a common format, generally that used by local
storage, to serve as a means of identifying the users corresponding
to these security principals. When these local identifiers are
translated to the form of the owner attribute, associated with files
created by such principals they identify, in a common format, the
users associated with each corresponding set of security principals.
The translation used to interpret owner and group strings is not
specified as part of the protocol. This allows various solutions to
be employed. For example, a local translation table may be consulted
that maps between a numeric id to the user@dns_domain syntax. A name
service may also be used to accomplish the translation. A server may
provide a more general service, not limited by any particular
translation (which would only translate a limited set of possible
strings) by storing the owner and owner_group attributes in local
storage without any translation or it may augment a translation
method by storing the entire string for attributes for which no
translation is available while using the local representation for
those cases in which a translation is available.
Servers that do not provide support for all possible values of the
owner and owner_group attributes, should return an error
(NFS4ERR_BADOWNER) when a string is presented that has no
translation, as the value to be set for a SETATTR of the owner,
owner_group, or acl attributes. When a server does accept an owner
or owner_group value as valid on a SETATTR (and similarly for the
owner and group strings in an acl), it is promising to return that
same string when a corresponding GETATTR is done. Configuration
changes and ill-constructed name translations (those that contain
aliasing) may make that promise impossible to honor. Servers should
make appropriate efforts to avoid a situation in which these
attributes have their values changed when no real change to ownership
has occurred.
The "dns_domain" portion of the owner string is meant to be a DNS
domain name. For example, user@ietf.org. Servers should accept as
valid a set of users for at least one domain. A server may treat
other domains as having no valid translations. A more general
service is provided when a server is capable of accepting users for
multiple domains, or for all domains, subject to security
constraints.
In the case where there is no translation available to the client or
server, the attribute value must be constructed without the "@".
Therefore, the absence of the @ from the owner or owner_group
attribute signifies that no translation was available at the sender
and that the receiver of the attribute should not use that string as
a basis for translation into its own internal format. Even though
the attribute value can not be translated, it may still be useful.
In the case of a client, the attribute string may be used for local
display of ownership.
To provide a greater degree of compatibility with previous versions
of NFS (i.e., v2 and v3), which identified users and groups by 32-bit
unsigned uid's and gid's, owner and group strings that consist of
decimal numeric values with no leading zeros can be given a special
interpretation by clients and servers which choose to provide such
support. The receiver may treat such a user or group string as
representing the same user as would be represented by a v2/v3 uid or
gid having the corresponding numeric value. A server is not
obligated to accept such a string, but may return an NFS4ERR_BADOWNER
instead. To avoid this mechanism being used to subvert user and
group translation, so that a client might pass all of the owners and
groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER
error when there is a valid translation for the user or owner
designated in this way. In that case, the client must use the
appropriate name@domain string and not the special form for
compatibility.
The owner string "nobody" may be used to designate an anonymous user,
which will be associated with a file created by a security principal
that cannot be mapped through normal means to the owner attribute.
5.9. Character Case Attributes
With respect to the case_insensitive and case_preserving attributes,
each UCS-4 character (which UTF-8 encodes) has a "long descriptive
name" [25] which may or may not included the word "CAPITAL" or
"SMALL". The presence of SMALL or CAPITAL allows an NFS server to
implement unambiguous and efficient table driven mappings for case
insensitive comparisons, and non-case-preserving storage. For
general character handling and internationalization issues, see
Section 1 "Internationalization".
5.10. Quota Attributes
For the attributes related to filesystem quotas, the following
definitions apply:
quota_avail_soft The value in bytes which represents the amount of
additional disk space that can be allocated to this file or
directory before the user may reasonably be warned. It is
understood that this space may be consumed by allocations to other
files or directories though there is a rule as to which other
files or directories.
quota_avail_hard The value in bytes which represent the amount of
additional disk space beyond the current allocation that can be
allocated to this file or directory before further allocations
will be refused. It is understood that this space may be consumed
by allocations to other files or directories.
quota_used The value in bytes which represent the amount of disc
space used by this file or directory and possibly a number of
other similar files or directories, where the set of "similar"
meets at least the criterion that allocating space to any file or
directory in the set will reduce the "quota_avail_hard" of every
other file or directory in the set.
Note that there may be a number of distinct but overlapping sets
of files or directories for which a quota_used value is maintained
(e.g., "all files with a given owner", "all files with a given
group owner", etc.).
The server is at liberty to choose any of those sets but should do
so in a repeatable way. The rule may be configured per-filesystem
or may be "choose the set with the smallest quota".
5.11. Access Control Lists
The NFS version 4 ACL attribute is an array of access control entries
(ACE). Although, the client can read and write the ACL attribute,
the NFSv4 model is the server does all access control based on the
server's interpretation of the ACL. If at any point the client wants
to check access without issuing an operation that modifies or reads
data or metadata, the client can use the OPEN and ACCESS operations
to do so. There are various access control entry types, as defined
in the Section "ACE type". The server is able to communicate which
ACE types are supported by returning the appropriate value within the
aclsupport attribute. Each ACE covers one or more operations on a
file or directory as described in the Section "ACE Access Mask". It
may also contain one or more flags that modify the semantics of the
ACE as defined in the Section "ACE flag".
The NFS ACE attribute is defined as follows:
typedef uint32_t acetype4;
typedef uint32_t aceflag4;
typedef uint32_t acemask4;
struct nfsace4 {
acetype4 type;
aceflag4 flag;
acemask4 access_mask;
utf8str_mixed who;
};
To determine if a request succeeds, each nfsace4 entry is processed
in order by the server. Only ACEs which have a "who" that matches
the requester are considered. Each ACE is processed until all of the
bits of the requester's access have been ALLOWED. Once a bit (see
below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer
considered in the processing of later ACEs. If an ACCESS_DENIED_ACE
is encountered where the requester's access still has unALLOWED bits
in common with the "access_mask" of the ACE, the request is denied.
However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT
ACE types do not affect a requester's access, and instead are for
triggering events as a result of a requester's access attempt.
Therefore, all AUDIT and ALARM ACEs are processed until end of the
ACL. When the ACL is fully processed, if there are bits in
requester's mask that have not been considered whether the server
allows or denies the access is undefined. If there is a mode
attribute on the file, then this cannot happen, since the mode's
MODE4_*OTH bits will map to EVERYONE@ ACEs that unambiguously specify
the requester's access.
The NFS version 4 ACL model is quite rich. Some server platforms may
provide access control functionality that goes beyond the UNIX-style
mode attribute, but which is not as rich as the NFS ACL model. So
that users can take advantage of this more limited functionality, the
server may indicate that it supports ACLs as long as it follows the
guidelines for mapping between its ACL model and the NFS version 4
ACL model.
The situation is complicated by the fact that a server may have
multiple modules that enforce ACLs. For example, the enforcement for
NFS version 4 access may be different from the enforcement for local
access, and both may be different from the enforcement for access
through other protocols such as SMB. So it may be useful for a
server to accept an ACL even if not all of its modules are able to
support it.
The guiding principle in all cases is that the server must not accept
ACLs that appear to make the file more secure than it really is.
5.11.1. ACE type
+-------+-----------------------------------------------------------+
| Type | Description |
+-------+-----------------------------------------------------------+
| ALLOW | Explicitly grants the access defined in acemask4 to the |
| | file or directory. |
| DENY | Explicitly denies the access defined in acemask4 to the |
| | file or directory. |
| AUDIT | LOG (system dependent) any access attempt to a file or |
| | directory which uses any of the access methods specified |
| | in acemask4. |
| ALARM | Generate a system ALARM (system dependent) when any |
| | access attempt is made to a file or directory for the |
| | access methods specified in acemask4. |
+-------+-----------------------------------------------------------+
Table 4
A server need not support all of the above ACE types. The bitmask
constants used to represent the above definitions within the
aclsupport attribute are as follows:
const ACL4_SUPPORT_ALLOW_ACL = 0x00000001;
const ACL4_SUPPORT_DENY_ACL = 0x00000002;
const ACL4_SUPPORT_AUDIT_ACL = 0x00000004;
const ACL4_SUPPORT_ALARM_ACL = 0x00000008;
The semantics of the "type" field follow the descriptions provided
above.
The constants used for the type field (acetype4) are as follows:
const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000;
const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001;
const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002;
const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003;
Clients should not attempt to set an ACE unless the server claims
support for that ACE type. If the server receives a request to set
an ACE that it cannot store, it MUST reject the request with
NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE
that it can store but cannot enforce, the server SHOULD reject the
request with NFS4ERR_ATTRNOTSUPP.
Example: suppose a server can enforce NFS ACLs for NFS access but
cannot enforce ACLs for local access. If arbitrary processes can run
on the server, then the server SHOULD NOT indicate ACL support. On
the other hand, if only trusted administrative programs run locally,
then the server may indicate ACL support.
5.11.2. ACE Access Mask
The access_mask field contains values based on the following:
+-------------------+-----------------------------------------------+
| Access | Description |
+-------------------+-----------------------------------------------+
| READ_DATA | Permission to read the data of the file |
| LIST_DIRECTORY | Permission to list the contents of a |
| | directory |
| WRITE_DATA | Permission to modify the file's data |
| ADD_FILE | Permission to add a new file to a directory |
| APPEND_DATA | Permission to append data to a file |
| ADD_SUBDIRECTORY | Permission to create a subdirectory to a |
| | directory |
| READ_NAMED_ATTRS | Permission to read the named attributes of a |
| | file |
| WRITE_NAMED_ATTRS | Permission to write the named attributes of a |
| | file |
| EXECUTE | Permission to execute a file |
| DELETE_CHILD | Permission to delete a file or directory |
| | within a directory |
| READ_ATTRIBUTES | The ability to read basic attributes |
| | (non-acls) of a file |
| WRITE_ATTRIBUTES | Permission to change basic attributes |
| | (non-acls) of a file |
| DELETE | Permission to Delete the file |
| READ_ACL | Permission to Read the ACL |
| WRITE_ACL | Permission to Write the ACL |
| WRITE_OWNER | Permission to change the owner |
| SYNCHRONIZE | Permission to access file locally at the |
| | server with synchronous reads and writes |
+-------------------+-----------------------------------------------+
Table 5
The bitmask constants used for the access mask field are as follows:
const ACE4_READ_DATA = 0x00000001;
const ACE4_LIST_DIRECTORY = 0x00000001;
const ACE4_WRITE_DATA = 0x00000002;
const ACE4_ADD_FILE = 0x00000002;
const ACE4_APPEND_DATA = 0x00000004;
const ACE4_ADD_SUBDIRECTORY = 0x00000004;
const ACE4_READ_NAMED_ATTRS = 0x00000008;
const ACE4_WRITE_NAMED_ATTRS = 0x00000010;
const ACE4_EXECUTE = 0x00000020;
const ACE4_DELETE_CHILD = 0x00000040;
const ACE4_READ_ATTRIBUTES = 0x00000080;
const ACE4_WRITE_ATTRIBUTES = 0x00000100;
const ACE4_DELETE = 0x00010000;
const ACE4_READ_ACL = 0x00020000;
const ACE4_WRITE_ACL = 0x00040000;
const ACE4_WRITE_OWNER = 0x00080000;
const ACE4_SYNCHRONIZE = 0x00100000;
Server implementations need not provide the granularity of control
that is implied by this list of masks. For example, POSIX-based
systems might not distinguish APPEND_DATA (the ability to append to a
file) from WRITE_DATA (the ability to modify existing contents); both
masks would be tied to a single "write" permission. When such a
server returns attributes to the client, it would show both
APPEND_DATA and WRITE_DATA if and only if the write permission is
enabled.
If a server receives a SETATTR request that it cannot accurately
implement, it should error in the direction of more restricted
access. For example, suppose a server cannot distinguish overwriting
data from appending new data, as described in the previous paragraph.
If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is
not (or vice versa), the server should reject the request with
NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type DENY, the
server may silently turn on the other bit, so that both APPEND_DATA
and WRITE_DATA are denied.
5.11.3. ACE flag
The "flag" field contains values based on the following descriptions.
ACE4_FILE_INHERIT_ACE Can be placed on a directory and indicates
that this ACE should be added to each new non-directory file
created.
ACE4_DIRECTORY_INHERIT_ACE Can be placed on a directory and
indicates that this ACE should be added to each new directory
created.
ACE4_INHERIT_ONLY_ACE Can be placed on a directory but does not
apply to the directory, only to newly created files/directories as
specified by the above two flags.
ACE4_NO_PROPAGATE_INHERIT_ACE Can be placed on a directory.
Normally when a new directory is created and an ACE exists on the
parent directory which is marked ACL4_DIRECTORY_INHERIT_ACE, two
ACEs are placed on the new directory. One for the directory
itself and one which is an inheritable ACE for newly created
directories. This flag tells the server to not place an ACE on
the newly created directory which is inheritable by subdirectories
of the created directory.
ACE4_SUCCESSFUL_ACCESS_ACE_FLAG
ACL4_FAILED_ACCESS_ACE_FLAG The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG
(SUCCESS) and ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits
relate only to ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and
ACE4_SYSTEM_ALARM_ACE_TYPE (ALARM) ACE types. If during the
processing of the file's ACL, the server encounters an AUDIT or
ALARM ACE that matches the principal attempting the OPEN, the
server notes that fact, and the presence, if any, of the SUCCESS
and FAILED flags encountered in the AUDIT or ALARM ACE. Once the
server completes the ACL processing, and the share reservation
processing, and the OPEN call, it then notes if the OPEN succeeded
or failed. If the OPEN succeeded, and if the SUCCESS flag was set
for a matching AUDIT or ALARM, then the appropriate AUDIT or ALARM
event occurs. If the OPEN failed, and if the FAILED flag was set
for the matching AUDIT or ALARM, then the appropriate AUDIT or
ALARM event occurs. Clearly either or both of the SUCCESS or
FAILED can be set, but if neither is set, the AUDIT or ALARM ACE
is not useful.
The previously described processing applies to that of the ACCESS
operation as well. The difference being that "success" or
"failure" does not mean whether ACCESS returns NFS4_OK or not.
Success means whether ACCESS returns all requested and supported
bits. Failure means whether ACCESS failed to return a bit that
was requested and supported.
ACE4_IDENTIFIER_GROUP Indicates that the "who" refers to a GROUP as
defined under UNIX.
The bitmask constants used for the flag field are as follows:
const ACE4_FILE_INHERIT_ACE = 0x00000001;
const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002;
const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004;
const ACE4_INHERIT_ONLY_ACE = 0x00000008;
const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010;
const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020;
const ACE4_IDENTIFIER_GROUP = 0x00000040;
A server need not support any of these flags. If the server supports
flags that are similar to, but not exactly the same as, these flags,
the implementation may define a mapping between the protocol-defined
flags and the implementation-defined flags. Again, the guiding
principle is that the file not appear to be more secure than it
really is.
For example, suppose a client tries to set an ACE with
ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the
server does not support any form of ACL inheritance, the server
should reject the request with NFS4ERR_ATTRNOTSUPP. If the server
supports a single "inherit ACE" flag that applies to both files and
directories, the server may reject the request (i.e., requiring the
client to set both the file and directory inheritance flags). The
server may also accept the request and silently turn on the
ACE4_DIRECTORY_INHERIT_ACE flag.
5.11.4. ACE who
There are several special identifiers ("who") which need to be
understood universally, rather than in the context of a particular
DNS domain. Some of these identifiers cannot be understood when an
NFS client accesses the server, but have meaning when a local process
accesses the file. The ability to display and modify these
permissions is permitted over NFS, even if none of the access methods
on the server understands the identifiers.
+-----------------+------------------------------------------------+
| Who | Description |
+-----------------+------------------------------------------------+
| "OWNER" | The owner of the file. |
| "GROUP" | The group associated with the file. |
| "EVERYONE" | The world. |
| "INTERACTIVE" | Accessed from an interactive terminal. |
| "NETWORK" | Accessed via the network. |
| "DIALUP" | Accessed as a dialup user to the server. |
| "BATCH" | Accessed from a batch job. |
| "ANONYMOUS" | Accessed without any authentication. |
| "AUTHENTICATED" | Any authenticated user (opposite of ANONYMOUS) |
| "SERVICE" | Access from a system service. |
+-----------------+------------------------------------------------+
Table 6
To avoid conflict, these special identifiers are distinguish by an
appended "@" and should appear in the form "xxxx@" (note: no domain
name after the "@"). For example: ANONYMOUS@.
5.11.5. Mode Attribute
The NFS version 4 mode attribute is based on the UNIX mode bits. The
following bits are defined:
const MODE4_SUID = 0x800; /* set user id on execution */
const MODE4_SGID = 0x400; /* set group id on execution */
const MODE4_SVTX = 0x200; /* save text even after use */
const MODE4_RUSR = 0x100; /* read permission: owner */
const MODE4_WUSR = 0x080; /* write permission: owner */
const MODE4_XUSR = 0x040; /* execute permission: owner */
const MODE4_RGRP = 0x020; /* read permission: group */
const MODE4_WGRP = 0x010; /* write permission: group */
const MODE4_XGRP = 0x008; /* execute permission: group */
const MODE4_ROTH = 0x004; /* read permission: other */
const MODE4_WOTH = 0x002; /* write permission: other */
const MODE4_XOTH = 0x001; /* execute permission: other */
Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal
identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and
MODE4_XGRP apply to the principals identified in the owner_group
attribute. Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any
principal that does not match that in the owner group, and does not
have a group matching that of the owner_group attribute.
The remaining bits are not defined by this protocol and MUST NOT be
used. The minor version mechanism must be used to define further bit
usage.
Note that in UNIX, if a file has the MODE4_SGID bit set and no
MODE4_XGRP bit set, then READ and WRITE must use mandatory file
locking.
5.11.6. Mode and ACL Attribute
The server that supports both mode and ACL must take care to
synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the
ACEs which have respective who fields of "OWNER@", "GROUP@", and
"EVERYONE@" so that the client can see semantically equivalent access
permissions exist whether the client asks for owner, owner_group and
mode attributes, or for just the ACL.
Because the mode attribute includes bits (e.g., MODE4_SVTX) that have
nothing to do with ACL semantics, it is permitted for clients to
specify both the ACL attribute and mode in the same SETATTR
operation. However, because there is no prescribed order for
processing the attributes in a SETATTR, the client must ensure that
ACL attribute, if specified without mode, would produce the desired
mode bits, and conversely, the mode attribute if specified without
ACL, would produce the desired "OWNER@", "GROUP@", and "EVERYONE@"
ACEs.
5.11.7. mounted_on_fileid
UNIX-based operating environments connect a filesystem into the
namespace by connecting (mounting) the filesystem onto the existing
file object (the mount point, usually a directory) of an existing
filesystem. When the mount point's parent directory is read via an
API like readdir(), the return results are directory entries, each
with a component name and a fileid. The fileid of the mount point's
directory entry will be different from the fileid that the stat()
system call returns. The stat() system call is returning the fileid
of the root of the mounted filesystem, whereas readdir() is returning
the fileid stat() would have returned before any filesystems were
mounted on the mount point.
Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request
to cross other filesystems. The client detects the filesystem
crossing whenever the filehandle argument of LOOKUP has an fsid
attribute different from that of the filehandle returned by LOOKUP.
A UNIX-based client will consider this a "mount point crossing".
UNIX has a legacy scheme for allowing a process to determine its
current working directory. This relies on readdir() of a mount
point's parent and stat() of the mount point returning fileids as
previously described. The mounted_on_fileid attribute corresponds to
the fileid that readdir() would have returned as described
previously.
While the NFS version 4 client could simply fabricate a fileid
corresponding to what mounted_on_fileid provides (and if the server
does not support mounted_on_fileid, the client has no choice), there
is a risk that the client will generate a fileid that conflicts with
one that is already assigned to another object in the filesystem.
Instead, if the server can provide the mounted_on_fileid, the
potential for client operational problems in this area is eliminated.
If the server detects that there is no mounted point at the target
file object, then the value for mounted_on_fileid that it returns is
the same as that of the fileid attribute.
The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD
provide it if possible, and for a UNIX-based server, this is
straightforward. Usually, mounted_on_fileid will be requested during
a READDIR operation, in which case it is trivial (at least for UNIX-
based servers) to return mounted_on_fileid since it is equal to the
fileid of a directory entry returned by readdir(). If
mounted_on_fileid is requested in a GETATTR operation, the server
should obey an invariant that has it returning a value that is equal
to the file object's entry in the object's parent directory, i.e.,
what readdir() would have returned. Some operating environments
allow a series of two or more filesystems to be mounted onto a single
mount point. In this case, for the server to obey the aforementioned
invariant, it will need to find the base mount point, and not the
intermediate mount points.
6. Filesystem Migration and Replication
With the use of the recommended attribute "fs_locations", the NFS
version 4 server has a method of providing filesystem migration or
replication services. For the purposes of migration and replication,
a filesystem will be defined as all files that share a given fsid
(both major and minor values are the same).
The fs_locations attribute provides a list of filesystem locations.
These locations are specified by providing the server name (either
DNS domain or IP address) and the path name representing the root of
the filesystem. Depending on the type of service being provided, the
list will provide a new location or a set of alternate locations for
the filesystem. The client will use this information to redirect its
requests to the new server.
6.1. Replication
It is expected that filesystem replication will be used in the case
of read-only data. Typically, the filesystem will be replicated on
two or more servers. The fs_locations attribute will provide the
list of these locations to the client. On first access of the
filesystem, the client should obtain the value of the fs_locations
attribute. If, in the future, the client finds the server
unresponsive, the client may attempt to use another server specified
by fs_locations.
If applicable, the client must take the appropriate steps to recover
valid filehandles from the new server. This is described in more
detail in the following sections.
6.2. Migration
Filesystem migration is used to move a filesystem from one server to
another. Migration is typically used for a filesystem that is
writable and has a single copy. The expected use of migration is for
load balancing or general resource reallocation. The protocol does
not specify how the filesystem will be moved between servers. This
server-to-server transfer mechanism is left to the server
implementor. However, the method used to communicate the migration
event between client and server is specified here.
Once the servers participating in the migration have completed the
move of the filesystem, the error NFS4ERR_MOVED will be returned for
subsequent requests received by the original server. The
NFS4ERR_MOVED error is returned for all operations except PUTFH and
GETATTR. Upon receiving the NFS4ERR_MOVED error, the client will
obtain the value of the fs_locations attribute. The client will then
use the contents of the attribute to redirect its requests to the
specified server. To facilitate the use of GETATTR, operations such
as PUTFH must also be accepted by the server for the migrated file
system's filehandles. Note that if the server returns NFS4ERR_MOVED,
the server MUST support the fs_locations attribute.
If the client requests more attributes than just fs_locations, the
server may return fs_locations only. This is to be expected since
the server has migrated the filesystem and may not have a method of
obtaining additional attribute data.
The server implementor needs to be careful in developing a migration
solution. The server must consider all of the state information
clients may have outstanding at the server. This includes but is not
limited to locking/share state, delegation state, and asynchronous
file writes which are represented by WRITE and COMMIT verifiers. The
server should strive to minimize the impact on its clients during and
after the migration process.
6.3. Interpretation of the fs_locations Attribute
The fs_location attribute is structured in the following way:
struct fs_location4 {
utf8str_cis server<>;
pathname4 rootpath;
};
struct fs_locations4 {
pathname4 fs_root;
fs_location4 locations<>;
};
The fs_location struct is used to represent the location of a
filesystem by providing a server name and the path to the root of the
filesystem. For a multi-homed server or a set of servers that use
the same rootpath, an array of server names may be provided. An
entry in the server array is an UTF8 string and represents one of a
traditional DNS host name, IPv4 address, or IPv6 address. It is not
a requirement that all servers that share the same rootpath be listed
in one fs_location struct. The array of server names is provided for
convenience. Servers that share the same rootpath may also be listed
in separate fs_location entries in the fs_locations attribute.
The fs_locations struct and attribute then contains an array of
locations. Since the name space of each server may be constructed
differently, the "fs_root" field is provided. The path represented
by fs_root represents the location of the filesystem in the server's
name space. Therefore, the fs_root path is only associated with the
server from which the fs_locations attribute was obtained. The
fs_root path is meant to aid the client in locating the filesystem at
the various servers listed.
As an example, there is a replicated filesystem located at two
servers (servA and servB). At servA the filesystem is located at
path "/a/b/c". At servB the filesystem is located at path "/x/y/z".
In this example the client accesses the filesystem first at servA
with a multi-component lookup path of "/a/b/c/d". Since the client
used a multi-component lookup to obtain the filehandle at "/a/b/c/d",
it is unaware that the filesystem's root is located in servA's name
space at "/a/b/c". When the client switches to servB, it will need
to determine that the directory it first referenced at servA is now
represented by the path "/x/y/z/d" on servB. To facilitate this, the
fs_locations attribute provided by servA would have a fs_root value
of "/a/b/c" and two entries in fs_location. One entry in fs_location
will be for itself (servA) and the other will be for servB with a
path of "/x/y/z". With this information, the client is able to
substitute "/x/y/z" for the "/a/b/c" at the beginning of its access
path and construct "/x/y/z/d" to use for the new server.
See Section 16 "Security Considerations" for a discussion on the
recommendations for the security flavor to be used by any GETATTR
operation that requests the "fs_locations" attribute.
6.4. Filehandle Recovery for Migration or Replication
Filehandles for filesystems that are replicated or migrated generally
have the same semantics as for filesystems that are not replicated or
migrated. For example, if a filesystem has persistent filehandles
and it is migrated to another server, the filehandle values for the
filesystem will be valid at the new server.
For volatile filehandles, the servers involved likely do not have a
mechanism to transfer filehandle format and content between
themselves. Therefore, a server may have difficulty in determining
if a volatile filehandle from an old server should return an error of
NFS4ERR_FHEXPIRED. Therefore, the client is informed, with the use
of the fh_expire_type attribute, whether volatile filehandles will
expire at the migration or replication event. If the bit
FH4_VOL_MIGRATION is set in the fh_expire_type attribute, the client
must treat the volatile filehandle as if the server had returned the
NFS4ERR_FHEXPIRED error. At the migration or replication event in
the presence of the FH4_VOL_MIGRATION bit, the client will not
present the original or old volatile filehandle to the new server.
The client will start its communication with the new server by
recovering its filehandles using the saved file names.
7. NFS Server Name Space
7.1. Server Exports
On a UNIX server the name space describes all the files reachable by
pathnames under the root directory or "/". On a Windows NT server
the name space constitutes all the files on disks named by mapped
disk letters. NFS server administrators rarely make the entire
server's filesystem name space available to NFS clients. More often
portions of the name space are made available via an "export"
feature. In previous versions of the NFS protocol, the root
filehandle for each export is obtained through the MOUNT protocol;
the client sends a string that identifies the export of name space
and the server returns the root filehandle for it. The MOUNT
protocol supports an EXPORTS procedure that will enumerate the
server's exports.
7.2. Browsing Exports
The NFS version 4 protocol provides a root filehandle that clients
can use to obtain filehandles for these exports via a multi-component
LOOKUP. A common user experience is to use a graphical user
interface (perhaps a file "Open" dialog window) to find a file via
progressive browsing through a directory tree. The client must be
able to move from one export to another export via single-component,
progressive LOOKUP operations.
This style of browsing is not well supported by the NFS version 2 and
3 protocols. The client expects all LOOKUP operations to remain
within a single server filesystem. For example, the device attribute
will not change. This prevents a client from taking name space paths
that span exports.
An automounter on the client can obtain a snapshot of the server's
name space using the EXPORTS procedure of the MOUNT protocol. If it
understands the server's pathname syntax, it can create an image of
the server's name space on the client. The parts of the name space
that are not exported by the server are filled in with a "pseudo
filesystem" that allows the user to browse from one mounted
filesystem to another. There is a drawback to this representation of
the server's name space on the client: it is static. If the server
administrator adds a new export the client will be unaware of it.
7.3. Server Pseudo Filesystem
NFS version 4 servers avoid this name space inconsistency by
presenting all the exports within the framework of a single server
name space. An NFS version 4 client uses LOOKUP and READDIR
operations to browse seamlessly from one export to another. Portions
of the server name space that are not exported are bridged via a
"pseudo filesystem" that provides a view of exported directories
only. A pseudo filesystem has a unique fsid and behaves like a
normal, read only filesystem.
Based on the construction of the server's name space, it is possible
that multiple pseudo filesystems may exist. For example,
/a pseudo filesystem
/a/b real filesystem
/a/b/c pseudo filesystem
/a/b/c/d real filesystem
Each of the pseudo filesystems are considered separate entities and
therefore will have a unique fsid.
7.4. Multiple Roots
The DOS and Windows operating environments are sometimes described as
having "multiple roots". Filesystems are commonly represented as
disk letters. MacOS represents filesystems as top level names. NFS
version 4 servers for these platforms can construct a pseudo file
system above these root names so that disk letters or volume names
are simply directory names in the pseudo root.
7.5. Filehandle Volatility
The nature of the server's pseudo filesystem is that it is a logical
representation of filesystem(s) available from the server.
Therefore, the pseudo filesystem is most likely constructed
dynamically when the server is first instantiated. It is expected
that the pseudo filesystem may not have an on disk counterpart from
which persistent filehandles could be constructed. Even though it is
preferable that the server provide persistent filehandles for the
pseudo filesystem, the NFS client should expect that pseudo file
system filehandles are volatile. This can be confirmed by checking
the associated "fh_expire_type" attribute for those filehandles in
question. If the filehandles are volatile, the NFS client must be
prepared to recover a filehandle value (e.g., with a multi-component
LOOKUP) when receiving an error of NFS4ERR_FHEXPIRED.
7.6. Exported Root
If the server's root filesystem is exported, one might conclude that
a pseudo-filesystem is not needed. This would be wrong. Assume the
following filesystems on a server:
/ disk1 (exported)
/a disk2 (not exported)
/a/b disk3 (exported)
Because disk2 is not exported, disk3 cannot be reached with simple
LOOKUPs. The server must bridge the gap with a pseudo-filesystem.
7.7. Mount Point Crossing
The server filesystem environment may be constructed in such a way
that one filesystem contains a directory which is 'covered' or
mounted upon by a second filesystem. For example:
/a/b (filesystem 1)
/a/b/c/d (filesystem 2)
The pseudo filesystem for this server may be constructed to look
like:
/ (place holder/not exported)
/a/b (filesystem 1)
/a/b/c/d (filesystem 2)
It is the server's responsibility to present the pseudo filesystem
that is complete to the client. If the client sends a lookup request
for the path "/a/b/c/d", the server's response is the filehandle of
the filesystem "/a/b/c/d". In previous versions of the NFS protocol,
the server would respond with the filehandle of directory "/a/b/c/d"
within the filesystem "/a/b".
The NFS client will be able to determine if it crosses a server mount
point by a change in the value of the "fsid" attribute.
7.8. Security Policy and Name Space Presentation
The application of the server's security policy needs to be carefully
considered by the implementor. One may choose to limit the
viewability of portions of the pseudo filesystem based on the
server's perception of the client's ability to authenticate itself
properly. However, with the support of multiple security mechanisms
and the ability to negotiate the appropriate use of these mechanisms,
the server is unable to properly determine if a client will be able
to authenticate itself. If, based on its policies, the server
chooses to limit the contents of the pseudo filesystem, the server
may effectively hide filesystems from a client that may otherwise
have legitimate access.
As suggested practice, the server should apply the security policy of
a shared resource in the server's namespace to the components of the
resource's ancestors. For example:
/
/a/b
/a/b/c
The /a/b/c directory is a real filesystem and is the shared resource.
The security policy for /a/b/c is Kerberos with integrity. The
server should apply the same security policy to /, /a, and /a/b.
This allows for the extension of the protection of the server's
namespace to the ancestors of the real shared resource.
For the case of the use of multiple, disjoint security mechanisms in
the server's resources, the security for a particular object in the
server's namespace should be the union of all security mechanisms of
all direct descendants.
8. File Locking and Share Reservations
Integrating locking into the NFS protocol necessarily causes it to be
stateful. With the inclusion of share reservations the protocol
becomes substantially more dependent on state than the traditional
combination of NFS and NLM [26]. There are three components to
making this state manageable:
o Clear division between client and server
o Ability to reliably detect inconsistency in state between client
and server
o Simple and robust recovery mechanisms
In this model, the server owns the state information. The client
communicates its view of this state to the server as needed. The
client is also able to detect inconsistent state before modifying a
file.
To support Win32 share reservations it is necessary to atomically
OPEN or CREATE files. Having a separate share/unshare operation
would not allow correct implementation of the Win32 OpenFile API. In
order to correctly implement share semantics, the previous NFS
protocol mechanisms used when a file is opened or created (LOOKUP,
CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has
an OPEN operation that subsumes the NFS version 3 methodology of
LOOKUP, CREATE, and ACCESS. However, because many operations require
a filehandle, the traditional LOOKUP is preserved to map a file name
to filehandle without establishing state on the server. The policy
of granting access or modifying files is managed by the server based
on the client's state. These mechanisms can implement policy ranging
from advisory only locking to full mandatory locking.
8.1. Locking
It is assumed that manipulating a lock is rare when compared to READ
and WRITE operations. It is also assumed that crashes and network
partitions are relatively rare. Therefore it is important that the
READ and WRITE operations have a lightweight mechanism to indicate if
they possess a held lock. A lock request contains the heavyweight
information required to establish a lock and uniquely define the lock
owner.
The following sections describe the transition from the heavy weight
information to the eventual stateid used for most client and server
locking and lease interactions.
8.1.1. Client ID
For each LOCK request, the client must identify itself to the server.
This is done in such a way as to allow for correct lock
identification and crash recovery. A sequence of a SETCLIENTID
operation followed by a SETCLIENTID_CONFIRM operation is required to
establish the identification onto the server. Establishment of
identification by a new incarnation of the client also has the effect
of immediately breaking any leased state that a previous incarnation
of the client might have had on the server, as opposed to forcing the
new client incarnation to wait for the leases to expire. Breaking
the lease state amounts to the server removing all lock, share
reservation, and, where the server is not supporting the
CLAIM_DELEGATE_PREV claim type, all delegation state associated with
same client with the same identity. For discussion of delegation
state recovery, see Section 9.2.1 "Delegation Recovery".
Client identification is encapsulated in the following structure:
struct SETCLIENTID4args {
nfs_client_id4 client;
cb_client4 callback;
uint32_t callback_ident;
};
The first field, verifier is a client incarnation verifier that is
used to detect client reboots. Only if the verifier is different
from that which the server has previously recorded the client (as
identified by the second field of the structure, id) does the server
start the process of canceling the client's leased state.
The second field, id is a variable length string that uniquely
defines the client.
There are several considerations for how the client generates the id
string:
o The string should be unique so that multiple clients do not
present the same string. The consequences of two clients
presenting the same string range from one client getting an error
to one client having its leased state abruptly and unexpectedly
canceled.
o The string should be selected so the subsequent incarnations
(e.g., reboots) of the same client cause the client to present the
same string. The implementor is cautioned against an approach
that requires the string to be recorded in a local file because
this precludes the use of the implementation in an environment
where there is no local disk and all file access is from an NFS
version 4 server.
o The string should be different for each server network address
that the client accesses, rather than common to all server network
addresses. The reason is that it may not be possible for the
client to tell if the same server is listening on multiple network
addresses. If the client issues SETCLIENTID with the same id
string to each network address of such a server, the server will
think it is the same client, and each successive SETCLIENTID will
cause the server to begin the process of removing the client's
previous leased state.
o The algorithm for generating the string should not assume that the
client's network address won't change. This includes changes
between client incarnations and even changes while the client is
stilling running in its current incarnation. This means that if
the client includes just the client's and server's network address
in the id string, there is a real risk, after the client gives up
the network address, that another client, using a similar
algorithm for generating the id string, will generate a
conflicting id string.
Given the above considerations, an example of a well generated id
string is one that includes:
o The server's network address.
o The client's network address.
o For a user level NFS version 4 client, it should contain
additional information to distinguish the client from other user
level clients running on the same host, such as a process id or
other unique sequence.
o Additional information that tends to be unique, such as one or
more of:
* The client machine's serial number (for privacy reasons, it is
best to perform some one way function on the serial number).
* A MAC address.
* The timestamp of when the NFS version 4 software was first
installed on the client (though this is subject to the
previously mentioned caution about using information that is
stored in a file, because the file might only be accessible
over NFS version 4).
* A true random number. However since this number ought to be
the same between client incarnations, this shares the same
problem as that of the using the timestamp of the software
installation.
As a security measure, the server MUST NOT cancel a client's leased
state if the principal established the state for a given id string is
not the same as the principal issuing the SETCLIENTID.
Note that SETCLIENTID and SETCLIENTID_CONFIRM has a secondary purpose
of establishing the information the server needs to make callbacks to
the client for purpose of supporting delegations. It is permitted to
change this information via SETCLIENTID and SETCLIENTID_CONFIRM
within the same incarnation of the client without removing the
client's leased state.
Once a SETCLIENTID and SETCLIENTID_CONFIRM sequence has successfully
completed, the client uses the shorthand client identifier, of type
clientid4, instead of the longer and less compact nfs_client_id4
structure. This shorthand client identifier (a clientid) is assigned
by the server and should be chosen so that it will not conflict with
a clientid previously assigned by the server. This applies across
server restarts or reboots. When a clientid is presented to a server
and that clientid is not recognized, as would happen after a server
reboot, the server will reject the request with the error
NFS4ERR_STALE_CLIENTID. When this happens, the client must obtain a
new clientid by use of the SETCLIENTID operation and then proceed to
any other necessary recovery for the server reboot case (See
Section 8.6.2 "Server Failure and Recovery").
The client must also employ the SETCLIENTID operation when it
receives a NFS4ERR_STALE_STATEID error using a stateid derived from
its current clientid, since this also indicates a server reboot which
has invalidated the existing clientid (see Section 8.1.3 "lock_owner
and stateid Definition" for details).
See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM
for a complete specification of the operations.
8.1.2. Server Release of Clientid
If the server determines that the client holds no associated state
for its clientid, the server may choose to release the clientid. The
server may make this choice for an inactive client so that resources
are not consumed by those intermittently active clients. If the
client contacts the server after this release, the server must ensure
the client receives the appropriate error so that it will use the
SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity.
It should be clear that the server must be very hesitant to release a
clientid since the resulting work on the client to recover from such
an event will be the same burden as if the server had failed and
restarted. Typically a server would not release a clientid unless
there had been no activity from that client for many minutes.
Note that if the id string in a SETCLIENTID request is properly
constructed, and if the client takes care to use the same principal
for each successive use of SETCLIENTID, then, barring an active
denial of service attack, NFS4ERR_CLID_INUSE should never be
returned.
However, client bugs, server bugs, or perhaps a deliberate change of
the principal owner of the id string (such as the case of a client
that changes security flavors, and under the new flavor, there is no
mapping to the previous owner) will in rare cases result in
NFS4ERR_CLID_INUSE.
In that event, when the server gets a SETCLIENTID for a client id
that currently has no state, or it has state, but the lease has
expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST
allow the SETCLIENTID, and confirm the new clientid if followed by
the appropriate SETCLIENTID_CONFIRM.
8.1.3. lock_owner and stateid Definition
When requesting a lock, the client must present to the server the
clientid and an identifier for the owner of the requested lock.
These two fields are referred to as the lock_owner and the definition
of those fields are:
o A clientid returned by the server as part of the client's use of
the SETCLIENTID operation.
o A variable length opaque array used to uniquely define the owner
of a lock managed by the client.
This may be a thread id, process id, or other unique value.
When the server grants the lock, it responds with a unique stateid.
The stateid is used as a shorthand reference to the lock_owner, since
the server will be maintaining the correspondence between them.
The server is free to form the stateid in any manner that it chooses
as long as it is able to recognize invalid and out-of-date stateids.
This requirement includes those stateids generated by earlier
instances of the server. From this, the client can be properly
notified of a server restart. This notification will occur when the
client presents a stateid to the server from a previous
instantiation.
The server must be able to distinguish the following situations and
return the error as specified:
o The stateid was generated by an earlier server instance (i.e.,
before a server reboot). The error NFS4ERR_STALE_STATEID should
be returned.
o The stateid was generated by the current server instance but the
stateid no longer designates the current locking state for the
lockowner-file pair in question (i.e., one or more locking
operations has occurred). The error NFS4ERR_OLD_STATEID should be
returned.
This error condition will only occur when the client issues a
locking request which changes a stateid while an I/O request that
uses that stateid is outstanding.
o The stateid was generated by the current server instance but the
stateid does not designate a locking state for any active
lockowner-file pair. The error NFS4ERR_BAD_STATEID should be
returned.
This error condition will occur when there has been a logic error
on the part of the client or server. This should not happen.
One mechanism that may be used to satisfy these requirements is for
the server to,
o divide the "other" field of each stateid into two fields:
* A server verifier which uniquely designates a particular server
instantiation.
* An index into a table of locking-state structures.
o utilize the "seqid" field of each stateid, such that seqid is
monotonically incremented for each stateid that is associated with
the same index into the locking-state table.
By matching the incoming stateid and its field values with the state
held at the server, the server is able to easily determine if a
stateid is valid for its current instantiation and state. If the
stateid is not valid, the appropriate error can be supplied to the
client.
8.1.4. Use of the stateid and Locking
All READ, WRITE and SETATTR operations contain a stateid. For the
purposes of this section, SETATTR operations which change the size
attribute of a file are treated as if they are writing the area
between the old and new size (i.e., the range truncated or added to
the file by means of the SETATTR), even where SETATTR is not
explicitly mentioned in the text.
If the lock_owner performs a READ or WRITE in a situation in which it
has established a lock or share reservation on the server (any OPEN
constitutes a share reservation) the stateid (previously returned by
the server) must be used to indicate what locks, including both
record locks and share reservations, are held by the lockowner. If
no state is established by the client, either record lock or share
reservation, a stateid of all bits 0 is used. Regardless whether a
stateid of all bits 0, or a stateid returned by the server is used,
if there is a conflicting share reservation or mandatory record lock
held on the file, the server MUST refuse to service the READ or WRITE
operation.
Share reservations are established by OPEN operations and by their
nature are mandatory in that when the OPEN denies READ or WRITE
operations, that denial results in such operations being rejected
with error NFS4ERR_LOCKED. Record locks may be implemented by the
server as either mandatory or advisory, or the choice of mandatory or
advisory behavior may be determined by the server on the basis of the
file being accessed (for example, some UNIX-based servers support a
"mandatory lock bit" on the mode attribute such that if set, record
locks are required on the file before I/O is possible). When record
locks are advisory, they only prevent the granting of conflicting
lock requests and have no effect on READs or WRITEs. Mandatory
record locks, however, prevent conflicting I/O operations. When they
are attempted, they are rejected with NFS4ERR_LOCKED. When the
client gets NFS4ERR_LOCKED on a file it knows it has the proper share
reservation for, it will need to issue a LOCK request on the region
of the file that includes the region the I/O was to be performed on,
with an appropriate locktype (i.e., READ*_LT for a READ operation,
WRITE*_LT for a WRITE operation).
With NFS version 3, there was no notion of a stateid so there was no
way to tell if the application process of the client sending the READ
or WRITE operation had also acquired the appropriate record lock on
the file. Thus there was no way to implement mandatory locking.
With the stateid construct, this barrier has been removed.
Note that for UNIX environments that support mandatory file locking,
the distinction between advisory and mandatory locking is subtle. In
fact, advisory and mandatory record locks are exactly the same in so
far as the APIs and requirements on implementation. If the mandatory
lock attribute is set on the file, the server checks to see if the
lockowner has an appropriate shared (read) or exclusive (write)
record lock on the region it wishes to read or write to. If there is
no appropriate lock, the server checks if there is a conflicting lock
(which can be done by attempting to acquire the conflicting lock on
the behalf of the lockowner, and if successful, release the lock
after the READ or WRITE is done), and if there is, the server returns
NFS4ERR_LOCKED.
For Windows environments, there are no advisory record locks, so the
server always checks for record locks during I/O requests.
Thus, the NFS version 4 LOCK operation does not need to distinguish
between advisory and mandatory record locks. It is the NFS version 4
server's processing of the READ and WRITE operations that introduces
the distinction.
Every stateid other than the special stateid values noted in this
section, whether returned by an OPEN-type operation (i.e., OPEN,
OPEN_DOWNGRADE), or by a LOCK-type operation (i.e., LOCK or LOCKU),
defines an access mode for the file (i.e., READ, WRITE, or READ-
WRITE) as established by the original OPEN which began the stateid
sequence, and as modified by subsequent OPENs and OPEN_DOWNGRADEs
within that stateid sequence. When a READ, WRITE, or SETATTR which
specifies the size attribute, is done, the operation is subject to
checking against the access mode to verify that the operation is
appropriate given the OPEN with which the operation is associated.
In the case of WRITE-type operations (i.e., WRITEs and SETATTRs which
set size), the server must verify that the access mode allows writing
and return an NFS4ERR_OPENMODE error if it does not. In the case, of
READ, the server may perform the corresponding check on the access
mode, or it may choose to allow READ on opens for WRITE only, to
accommodate clients whose write implementation may unavoidably do
reads (e.g., due to buffer cache constraints). However, even if
READs are allowed in these circumstances, the server MUST still check
for locks that conflict with the READ (e.g., another open specify
denial of READs). Note that a server which does enforce the access
mode check on READs need not explicitly check for conflicting share
reservations since the existence of OPEN for read access guarantees
that no conflicting share reservation can exist.
A stateid of all bits 1 (one) MAY allow READ operations to bypass
locking checks at the server. However, WRITE operations with a
stateid with bits all 1 (one) MUST NOT bypass locking checks and are
treated exactly the same as if a stateid of all bits 0 were used.
A lock may not be granted while a READ or WRITE operation using one
of the special stateids is being performed and the range of the lock
request conflicts with the range of the READ or WRITE operation. For
the purposes of this paragraph, a conflict occurs when a shared lock
is requested and a WRITE operation is being performed, or an
exclusive lock is requested and either a READ or a WRITE operation is
being performed. A SETATTR that sets size is treated similarly to a
WRITE as discussed above.
8.1.5. Sequencing of Lock Requests
Locking is different than most NFS operations as it requires "at-
most-one" semantics that are not provided by ONCRPC. ONCRPC over a
reliable transport is not sufficient because a sequence of locking
requests may span multiple TCP connections. In the face of
retransmission or reordering, lock or unlock requests must have a
well defined and consistent behavior. To accomplish this, each lock
request contains a sequence number that is a consecutively increasing
integer. Different lock_owners have different sequences. The server
maintains the last sequence number (L) received and the response that
was returned. The first request issued for any given lock_owner is
issued with a sequence number of zero.
Note that for requests that contain a sequence number, for each
lock_owner, there should be no more than one outstanding request.
If a request (r) with a previous sequence number (r < L) is received,
it is rejected with the return of error NFS4ERR_BAD_SEQID. Given a
properly-functioning client, the response to (r) must have been
received before the last request (L) was sent. If a duplicate of
last request (r == L) is received, the stored response is returned.
If a request beyond the next sequence (r == L + 2) is received, it is
rejected with the return of error NFS4ERR_BAD_SEQID. Sequence
history is reinitialized whenever the SETCLIENTID/SETCLIENTID_CONFIRM
sequence changes the client verifier.
Since the sequence number is represented with an unsigned 32-bit
integer, the arithmetic involved with the sequence number is mod
2^32. For an example of modulo arithmetic involving sequence numbers
see [27].
It is critical the server maintain the last response sent to the
client to provide a more reliable cache of duplicate non-idempotent
requests than that of the traditional cache described in [28]. The
traditional duplicate request cache uses a least recently used
algorithm for removing unneeded requests. However, the last lock
request and response on a given lock_owner must be cached as long as
the lock state exists on the server.
The client MUST monotonically increment the sequence number for the
CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE
operations. This is true even in the event that the previous
operation that used the sequence number received an error. The only
exception to this rule is if the previous operation received one of
the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID,
NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR,
NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE.
8.1.6. Recovery from Replayed Requests
As described above, the sequence number is per lock_owner. As long
as the server maintains the last sequence number received and follows
the methods described above, there are no risks of a Byzantine router
re-sending old requests. The server need only maintain the
(lock_owner, sequence number) state as long as there are open files
or closed files with locks outstanding.
LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence
number and therefore the risk of the replay of these operations
resulting in undesired effects is non-existent while the server
maintains the lock_owner state.
8.1.7. Releasing lock_owner State
When a particular lock_owner no longer holds open or file locking
state at the server, the server may choose to release the sequence
number state associated with the lock_owner. The server may make
this choice based on lease expiration, for the reclamation of server
memory, or other implementation specific details. In any event, the
server is able to do this safely only when the lock_owner no longer
is being utilized by the client. The server may choose to hold the
lock_owner state in the event that retransmitted requests are
received. However, the period to hold this state is implementation
specific.
In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is
retransmitted after the server has previously released the lock_owner
state, the server will find that the lock_owner has no files open and
an error will be returned to the client. If the lock_owner does have
a file open, the stateid will not match and again an error is
returned to the client.
8.1.8. Use of Open Confirmation
In the case that an OPEN is retransmitted and the lock_owner is being
used for the first time or the lock_owner state has been previously
released by the server, the use of the OPEN_CONFIRM operation will
prevent incorrect behavior. When the server observes the use of the
lock_owner for the first time, it will direct the client to perform
the OPEN_CONFIRM for the corresponding OPEN. This sequence
establishes the use of an lock_owner and associated sequence number.
Since the OPEN_CONFIRM sequence connects a new open_owner on the
server with an existing open_owner on a client, the sequence number
may have any value. The OPEN_CONFIRM step assures the server that
the value received is the correct one. See Section 14.20
"OPEN_CONFIRM - Confirm Open" for further details.
There are a number of situations in which the requirement to confirm
an OPEN would pose difficulties for the client and server, in that
they would be prevented from acting in a timely fashion on
information received, because that information would be provisional,
subject to deletion upon non-confirmation. Fortunately, these are
situations in which the server can avoid the need for confirmation
when responding to open requests. The two constraints are:
o The server must not bestow a delegation for any open which would
require confirmation.
o The server MUST NOT require confirmation on a reclaim-type open
(i.e., one specifying claim type CLAIM_PREVIOUS or
CLAIM_DELEGATE_PREV).
These constraints are related in that reclaim-type opens are the only
ones in which the server may be required to send a delegation. For
CLAIM_NULL, sending the delegation is optional while for
CLAIM_DELEGATE_CUR, no delegation is sent.
Delegations being sent with an open requiring confirmation are
troublesome because recovering from non-confirmation adds undue
complexity to the protocol while requiring confirmation on reclaim-
type opens poses difficulties in that the inability to resolve the
status of the reclaim until lease expiration may make it difficult to
have timely determination of the set of locks being reclaimed (since
the grace period may expire).
Requiring open confirmation on reclaim-type opens is avoidable
because of the nature of the environments in which such opens are
done. For CLAIM_PREVIOUS opens, this is immediately after server
reboot, so there should be no time for lockowners to be created,
found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens, we
are dealing with a client reboot situation. A server which supports
delegation can be sure that no lockowners for that client have been
recycled since client initialization and thus can ensure that
confirmation will not be required.
8.2. Lock Ranges
The protocol allows a lock owner to request a lock with a byte range
and then either upgrade or unlock a sub-range of the initial lock.
It is expected that this will be an uncommon type of request. In any
case, servers or server filesystems may not be able to support sub-
range lock semantics. In the event that a server receives a locking
request that represents a sub-range of current locking state for the
lock owner, the server is allowed to return the error
NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock
operations. Therefore, the client should be prepared to receive this
error and, if appropriate, report the error to the requesting
application.
The client is discouraged from combining multiple independent locking
ranges that happen to be adjacent into a single request since the
server may not support sub-range requests and for reasons related to
the recovery of file locking state in the event of server failure.
As discussed in the section "Server Failure and Recovery" below, the
server may employ certain optimizations during recovery that work
effectively only when the client's behavior during lock recovery is
similar to the client's locking behavior prior to server failure.
8.3. Upgrading and Downgrading Locks
If a client has a write lock on a record, it can request an atomic
downgrade of the lock to a read lock via the LOCK request, by setting
the type to READ_LT. If the server supports atomic downgrade, the
request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP.
The client should be prepared to receive this error, and if
appropriate, report the error to the requesting application.
If a client has a read lock on a record, it can request an atomic
upgrade of the lock to a write lock via the LOCK request by setting
the type to WRITE_LT or WRITEW_LT. If the server does not support
atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade
can be achieved without an existing conflict, the request will
succeed. Otherwise, the server will return either NFS4ERR_DENIED or
NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the
client issued the LOCK request with the type set to WRITEW_LT and the
server has detected a deadlock. The client should be prepared to
receive such errors and if appropriate, report the error to the
requesting application.
8.4. Blocking Locks
Some clients require the support of blocking locks. The NFS version
4 protocol must not rely on a callback mechanism and therefore is
unable to notify a client when a previously denied lock has been
granted. Clients have no choice but to continually poll for the
lock. This presents a fairness problem. Two new lock types are
added, READW and WRITEW, and are used to indicate to the server that
the client is requesting a blocking lock. The server should maintain
an ordered list of pending blocking locks. When the conflicting lock
is released, the server may wait the lease period for the first
waiting client to re-request the lock. After the lease period
expires the next waiting client request is allowed the lock. Clients
are required to poll at an interval sufficiently small that it is
likely to acquire the lock in a timely manner. The server is not
required to maintain a list of pending blocked locks as it is used to
increase fairness and not correct operation. Because of the
unordered nature of crash recovery, storing of lock state to stable
storage would be required to guarantee ordered granting of blocking
locks.
Servers may also note the lock types and delay returning denial of
the request to allow extra time for a conflicting lock to be
released, allowing a successful return. In this way, clients can
avoid the burden of needlessly frequent polling for blocking locks.
The server should take care in the length of delay in the event the
client retransmits the request.
8.5. Lease Renewal
The purpose of a lease is to allow a server to remove stale locks
that are held by a client that has crashed or is otherwise
unreachable. It is not a mechanism for cache consistency and lease
renewals may not be denied if the lease interval has not expired.
The following events cause implicit renewal of all of the leases for
a given client (i.e., all those sharing a given clientid). Each of
these is a positive indication that the client is still active and
that the associated state held at the server, for the client, is
still valid.
o An OPEN with a valid clientid.
o Any operation made with a valid stateid (CLOSE, DELEGPURGE,
DELEGRETURN, LOCK, LOCKU, OPEN, OPEN_CONFIRM, OPEN_DOWNGRADE,
READ, RENEW, SETATTR, WRITE). This does not include the special
stateids of all bits 0 or all bits 1.
Note that if the client had restarted or rebooted, the client
would not be making these requests without issuing the
SETCLIENTID/SETCLIENTID_CONFIRM sequence. The use of the
SETCLIENTID/SETCLIENTID_CONFIRM sequence (one that changes the
client verifier) notifies the server to drop the locking state
associated with the client. SETCLIENTID/SETCLIENTID_CONFIRM never
renews a lease.
If the server has rebooted, the stateids (NFS4ERR_STALE_STATEID
error) or the clientid (NFS4ERR_STALE_CLIENTID error) will not be
valid hence preventing spurious renewals.
This approach allows for low overhead lease renewal which scales
well. In the typical case no extra RPC calls are required for lease
renewal and in the worst case one RPC is required every lease period
(i.e., a RENEW operation). The number of locks held by the client is
not a factor since all state for the client is involved with the
lease renewal action.
Since all operations that create a new lease also renew existing
leases, the server must maintain a common lease expiration time for
all valid leases for a given client. This lease time can then be
easily updated upon implicit lease renewal actions.
8.6. Crash Recovery
The important requirement in crash recovery is that both the client
and the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server
restarts or reboots. All READ and WRITE operations that may have
been queued within the client or network buffers must wait until the
client has successfully recovered the locks protecting the READ and
WRITE operations.
8.6.1. Client Failure and Recovery
In the event that a client fails, the server may recover the client's
locks when the associated leases have expired. Conflicting locks
from another client may only be granted after this lease expiration.
If the client is able to restart or reinitialize within the lease
period the client may be forced to wait the remainder of the lease
period before obtaining new locks.
To minimize client delay upon restart, lock requests are associated
with an instance of the client by a client supplied verifier. This
verifier is part of the initial SETCLIENTID call made by the client.
The server returns a clientid as a result of the SETCLIENTID
operation. The client then confirms the use of the clientid with
SETCLIENTID_CONFIRM. The clientid in combination with an opaque
owner field is then used by the client to identify the lock owner for
OPEN. This chain of associations is then used to identify all locks
for a particular client.
Since the verifier will be changed by the client upon each
initialization, the server can compare a new verifier to the verifier
associated with currently held locks and determine that they do not
match. This signifies the client's new instantiation and subsequent
loss of locking state. As a result, the server is free to release
all locks held which are associated with the old clientid which was
derived from the old verifier.
Note that the verifier must have the same uniqueness properties of
the verifier for the COMMIT operation.
8.6.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart
or reboot), it must allow clients time to discover this fact and re-
establish the lost locking state. The client must be able to re-
establish the locking state without having the server deny valid
requests because the server has granted conflicting access to another
client. Likewise, if there is the possibility that clients have not
yet re-established their locking state for a file, the server must
disallow READ and WRITE operations for that file. The duration of
this recovery period is equal to the duration of the lease period.
A client can determine that server failure (and thus loss of locking
state) has occurred, when it receives one of two errors. The
NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a
reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a
clientid invalidated by reboot or restart. When either of these are
received, the client must establish a new clientid (See Section 8.1.1
"Client ID") and re-establish the locking state as discussed below.
The period of special handling of locking and READs and WRITEs, equal
in duration to the lease period, is referred to as the "grace
period". During the grace period, clients recover locks and the
associated state by reclaim-type locking requests (i.e., LOCK
requests with reclaim set to true and OPEN operations with a claim
type of CLAIM_PREVIOUS). During the grace period, the server must
reject READ and WRITE operations and non-reclaim locking requests
(i.e., other LOCK and OPEN operations) with an error of
NFS4ERR_GRACE.
If the server can reliably determine that granting a non-reclaim
request will not conflict with reclamation of locks by other clients,
the NFS4ERR_GRACE error does not have to be returned and the non-
reclaim client request can be serviced. For the server to be able to
service READ and WRITE operations during the grace period, it must
again be able to guarantee that no possible conflict could arise
between an impending reclaim locking request and the READ or WRITE
operation. If the server is unable to offer that guarantee, the
NFS4ERR_GRACE error must be returned to the client.
For a server to provide simple, valid handling during the grace
period, the easiest method is to simply reject all non-reclaim
locking requests and READ and WRITE operations by returning the
NFS4ERR_GRACE error. However, a server may keep information about
granted locks in stable storage. With this information, the server
could determine if a regular lock or READ or WRITE operation can be
safely processed.
For example, if a count of locks on a given file is available in
stable storage, the server can track reclaimed locks for the file and
when all reclaims have been processed, non-reclaim locking requests
may be processed. This way the server can ensure that non-reclaim
locking requests will not conflict with potential reclaim requests.
With respect to I/O requests, if the server is able to determine that
there are no outstanding reclaim requests for a file by information
from stable storage or another similar mechanism, the processing of
I/O requests could proceed normally for the file.
To reiterate, for a server that allows non-reclaim lock and I/O
requests to be processed during the grace period, it MUST determine
that no lock subsequently reclaimed will be rejected and that no lock
subsequently reclaimed would have prevented any I/O operation
processed during the grace period.
Clients should be prepared for the return of NFS4ERR_GRACE errors for
non-reclaim lock and I/O requests. In this case the client should
employ a retry mechanism for the request. A delay (on the order of
several seconds) between retries should be used to avoid overwhelming
the server. Further discussion of the general issue is included in
[19]. The client must account for the server that is able to perform
I/O and non-reclaim locking requests within the grace period as well
as those that can not do so.
A reclaim-type locking request outside the server's grace period can
only succeed if the server can guarantee that no conflicting lock or
I/O request has been granted since reboot or restart.
A server may, upon restart, establish a new value for the lease
period. Therefore, clients should, once a new clientid is
established, refetch the lease_time attribute and use it as the basis
for lease renewal for the lease associated with that server.
However, the server must establish, for this restart event, a grace
period at least as long as the lease period for the previous server
instantiation. This allows the client state obtained during the
previous server instance to be reliably re-established.
8.6.3. Network Partitions and Recovery
If the duration of a network partition is greater than the lease
period provided by the server, the server will have not received a
lease renewal from the client. If this occurs, the server may free
all locks held for the client. As a result, all stateids held by the
client will become invalid or stale. Once the client is able to
reach the server after such a network partition, all I/O submitted by
the client with the now invalid stateids will fail with the server
returning the error NFS4ERR_EXPIRED. Once this error is received,
the client will suitably notify the application that held the lock.
As a courtesy to the client or as an optimization, the server may
continue to hold locks on behalf of a client for which recent
communication has extended beyond the lease period. If the server
receives a lock or I/O request that conflicts with one of these
courtesy locks, the server must free the courtesy lock and grant the
new request.
When a network partition is combined with a server reboot, there are
edge conditions that place requirements on the server in order to
avoid silent data corruption following the server reboot. Two of
these edge conditions are known, and are discussed below.
The first edge condition has the following scenario:
1. Client A acquires a lock.
2. Client A and server experience mutual network partition, such
that client A is unable to renew its lease.
3. Client A's lease expires, so server releases lock.
4. Client B acquires a lock that would have conflicted with that of
Client A.
5. Client B releases the lock
6. Server reboots
7. Network partition between client A and server heals.
8. Client A issues a RENEW operation, and gets back a
NFS4ERR_STALE_CLIENTID.
9. Client A reclaims its lock within the server's grace period.
Thus, at the final step, the server has erroneously granted client
A's lock reclaim. If client B modified the object the lock was
protecting, client A will experience object corruption.
The second known edge condition follows:
1. Client A acquires a lock.
2. Server reboots.
3. Client A and server experience mutual network partition, such
that client A is unable to reclaim its lock within the grace
period.
4. Server's reclaim grace period ends. Client A has no locks
recorded on server.
5. Client B acquires a lock that would have conflicted with that of
Client A.
6. Client B releases the lock.
7. Server reboots a second time.
8. Network partition between client A and server heals.
9. Client A issues a RENEW operation, and gets back a
NFS4ERR_STALE_CLIENTID.
10. Client A reclaims its lock within the server's grace period.
As with the first edge condition, the final step of the scenario of
the second edge condition has the server erroneously granting client
A's lock reclaim.
Solving the first and second edge conditions requires that the server
either assume after it reboots that edge condition occurs, and thus
return NFS4ERR_NO_GRACE for all reclaim attempts, or that the server
record some information stable storage. The amount of information
the server records in stable storage is in inverse proportion to how
harsh the server wants to be whenever the edge conditions occur. The
server that is completely tolerant of all edge conditions will record
in stable storage every lock that is acquired, removing the lock
record from stable storage only when the lock is unlocked by the
client and the lock's lockowner advances the sequence number such
that the lock release is not the last stateful event for the
lockowner's sequence. For the two aforementioned edge conditions,
the harshest a server can be, and still support a grace period for
reclaims, requires that the server record in stable storage
information some minimal information. For example, a server
implementation could, for each client, save in stable storage a
record containing:
o the client's id string
o a boolean that indicates if the client's lease expired or if there
was administrative intervention (see the section, Server
Revocation of Locks) to revoke a record lock, share reservation,
or delegation
o a timestamp that is updated the first time after a server boot or
reboot the client acquires record locking, share reservation, or
delegation state on the server. The timestamp need not be updated
on subsequent lock requests until the server reboots.
The server implementation would also record in the stable storage the
timestamps from the two most recent server reboots.
Assuming the above record keeping, for the first edge condition,
after the server reboots, the record that client A's lease expired
means that another client could have acquired a conflicting record
lock, share reservation, or delegation. Hence the server must reject
a reclaim from client A with the error NFS4ERR_NO_GRACE.
For the second edge condition, after the server reboots for a second
time, the record that the client had an unexpired record lock, share
reservation, or delegation established before the server's previous
incarnation means that the server must reject a reclaim from client A
with the error NFS4ERR_NO_GRACE.
Regardless of the level and approach to record keeping, the server
MUST implement one of the following strategies (which apply to
reclaims of share reservations, record locks, and delegations):
1. Reject all reclaims with NFS4ERR_NO_GRACE. This is superharsh,
but necessary if the server does not want to record lock state in
stable storage.
2. Record sufficient state in stable storage such that all known
edge conditions involving server reboot, including the two noted
in this section, are detected. False positives are acceptable.
Note that at this time, it is not known if there are other edge
conditions. In the event, after a server reboot, the server
determines that there is unrecoverable damage or corruption to
the the stable storage, then for all clients and/or locks
affected, the server MUST return NFS4ERR_NO_GRACE.
A mandate for the client's handling of the NFS4ERR_NO_GRACE error is
outside the scope of this specification, since the strategies for
such handling are very dependent on the client's operating
environment. However, one potential approach is described below.
When the client receives NFS4ERR_NO_GRACE, it could examine the
change attribute of the objects the client is trying to reclaim state
for, and use that to determine whether to re-establish the state via
normal OPEN or LOCK requests. This is acceptable provided the
client's operating environment allows it. In otherwords, the client
implementor is advised to document for his users the behavior. The
client could also inform the application that its record lock or
share reservations (whether they were delegated or not) have been
lost, such as via a UNIX signal, a GUI pop-up window, etc. See
Section 9.5, "Data Caching and Revocation" for a discussion of what
the client should do for dealing with unreclaimed delegations on
client state.
For further discussion of revocation of locks see Section 8.8 "Server
Revocation of Locks".
8.7. Recovery from a Lock Request Timeout or Abort
In the event a lock request times out, a client may decide to not
retry the request. The client may also abort the request when the
process for which it was issued is terminated (e.g., in UNIX due to a
signal). It is possible though that the server received the request
and acted upon it. This would change the state on the server without
the client being aware of the change. It is paramount that the
client re-synchronize state with server before it attempts any other
operation that takes a seqid and/or a stateid with the same
lock_owner. This is straightforward to do without a special re-
synchronize operation.
Since the server maintains the last lock request and response
received on the lock_owner, for each lock_owner, the client should
cache the last lock request it sent such that the lock request did
not receive a response. From this, the next time the client does a
lock operation for the lock_owner, it can send the cached request, if
there is one, and if the request was one that established state
(e.g., a LOCK or OPEN operation), the server will return the cached
result or if never saw the request, perform it. The client can
follow up with a request to remove the state (e.g., a LOCKU or CLOSE
operation). With this approach, the sequencing and stateid
information on the client and server for the given lock_owner will
re-synchronize and in turn the lock state will re-synchronize.
8.8. Server Revocation of Locks
At any point, the server can revoke locks held by a client and the
client must be prepared for this event. When the client detects that
its locks have been or may have been revoked, the client is
responsible for validating the state information between itself and
the server. Validating locking state for the client means that it
must verify or reclaim state for each lock currently held.
The first instance of lock revocation is upon server reboot or re-
initialization. In this instance the client will receive an error
(NFS4ERR_STALE_STATEID or NFS4ERR_STALE_CLIENTID) and the client will
proceed with normal crash recovery as described in the previous
section.
The second lock revocation event is the inability to renew the lease
before expiration. While this is considered a rare or unusual event,
the client must be prepared to recover. Both the server and client
will be able to detect the failure to renew the lease and are capable
of recovering without data corruption. For the server, it tracks the
last renewal event serviced for the client and knows when the lease
will expire. Similarly, the client must track operations which will
renew the lease period. Using the time that each such request was
sent and the time that the corresponding reply was received, the
client should bound the time that the corresponding renewal could
have occurred on the server and thus determine if it is possible that
a lease period expiration could have occurred.
The third lock revocation event can occur as a result of
administrative intervention within the lease period. While this is
considered a rare event, it is possible that the server's
administrator has decided to release or revoke a particular lock held
by the client. As a result of revocation, the client will receive an
error of NFS4ERR_ADMIN_REVOKED. In this instance the client may
assume that only the lock_owner's locks have been lost. The client
notifies the lock holder appropriately. The client may not assume
the lease period has been renewed as a result of a failed operation.
When the client determines the lease period may have expired, the
client must mark all locks held for the associated lease as
"unvalidated". This means the client has been unable to re-establish
or confirm the appropriate lock state with the server. As described
in the previous section on crash recovery, there are scenarios in
which the server may grant conflicting locks after the lease period
has expired for a client. When it is possible that the lease period
has expired, the client must validate each lock currently held to
ensure that a conflicting lock has not been granted. The client may
accomplish this task by issuing an I/O request, either a pending I/O
or a zero-length read, specifying the stateid associated with the
lock in question. If the response to the request is success, the
client has validated all of the locks governed by that stateid and
re-established the appropriate state between itself and the server.
If the I/O request is not successful, then one or more of the locks
associated with the stateid was revoked by the server and the client
must notify the owner.
8.9. Share Reservations
A share reservation is a mechanism to control access to a file. It
is a separate and independent mechanism from record locking. When a
client opens a file, it issues an OPEN operation to the server
specifying the type of access required (READ, WRITE, or BOTH) and the
type of access to deny others (deny NONE, READ, WRITE, or BOTH). If
the OPEN fails the client will fail the application's open request.
Pseudo-code definition of the semantics:
if (request.access == 0)
return (NFS4ERR_INVAL)
else if ((request.access & file_state.deny)) ||
(request.deny & file_state.access))
return (NFS4ERR_DENIED)
This checking of share reservations on OPEN is done with no exception
for an existing OPEN for the same open_owner.
The constants used for the OPEN and OPEN_DOWNGRADE operations for the
access and deny fields are as follows:
const OPEN4_SHARE_ACCESS_READ = 0x00000001;
const OPEN4_SHARE_ACCESS_WRITE = 0x00000002;
const OPEN4_SHARE_ACCESS_BOTH = 0x00000003;
const OPEN4_SHARE_DENY_NONE = 0x00000000;
const OPEN4_SHARE_DENY_READ = 0x00000001;
const OPEN4_SHARE_DENY_WRITE = 0x00000002;
const OPEN4_SHARE_DENY_BOTH = 0x00000003;
8.10. OPEN/CLOSE Operations
To provide correct share semantics, a client MUST use the OPEN
operation to obtain the initial filehandle and indicate the desired
access and what if any access to deny. Even if the client intends to
use a stateid of all 0's or all 1's, it must still obtain the
filehandle for the regular file with the OPEN operation so the
appropriate share semantics can be applied. For clients that do not
have a deny mode built into their open programming interfaces, deny
equal to NONE should be used.
The OPEN operation with the CREATE flag, also subsumes the CREATE
operation for regular files as used in previous versions of the NFS
protocol. This allows a create with a share to be done atomically.
The CLOSE operation removes all share reservations held by the
lock_owner on that file. If record locks are held, the client SHOULD
release all locks before issuing a CLOSE. The server MAY free all
outstanding locks on CLOSE but some servers may not support the CLOSE
of a file that still has record locks held. The server MUST return
failure, NFS4ERR_LOCKS_HELD, if any locks would exist after the
CLOSE.
The LOOKUP operation will return a filehandle without establishing
any lock state on the server. Without a valid stateid, the server
will assume the client has the least access. For example, a file
opened with deny READ/WRITE cannot be accessed using a filehandle
obtained through LOOKUP because it would not have a valid stateid
(i.e., using a stateid of all bits 0 or all bits 1).
8.10.1. Close and Retention of State Information
Since a CLOSE operation requests deallocation of a stateid, dealing
with retransmission of the CLOSE, may pose special difficulties,
since the state information, which normally would be used to
determine the state of the open file being designated, might be
deallocated, resulting in an NFS4ERR_BAD_STATEID error.
Servers may deal with this problem in a number of ways. To provide
the greatest degree assurance that the protocol is being used
properly, a server should, rather than deallocate the stateid, mark
it as close-pending, and retain the stateid with this status, until
later deallocation. In this way, a retransmitted CLOSE can be
recognized since the stateid points to state information with this
distinctive status, so that it can be handled without error.
When adopting this strategy, a server should retain the state
information until the earliest of:
o Another validly sequenced request for the same lockowner, that is
not a retransmission.
o The time that a lockowner is freed by the server due to period
with no activity.
o All locks for the client are freed as a result of a SETCLIENTID.
Servers may avoid this complexity, at the cost of less complete
protocol error checking, by simply responding NFS4_OK in the event of
a CLOSE for a deallocated stateid, on the assumption that this case
must be caused by a retransmitted close. When adopting this
approach, it is desirable to at least log an error when returning a
no-error indication in this situation. If the server maintains a
reply-cache mechanism, it can verify the CLOSE is indeed a
retransmission and avoid error logging in most cases.
8.11. Open Upgrade and Downgrade
When an OPEN is done for a file and the lockowner for which the open
is being done already has the file open, the result is to upgrade the
open file status maintained on the server to include the access and
deny bits specified by the new OPEN as well as those for the existing
OPEN. The result is that there is one open file, as far as the
protocol is concerned, and it includes the union of the access and
deny bits for all of the OPEN requests completed. Only a single
CLOSE will be done to reset the effects of both OPENs. Note that the
client, when issuing the OPEN, may not know that the same file is in
fact being opened. The above only applies if both OPENs result in
the OPENed object being designated by the same filehandle.
When the server chooses to export multiple filehandles corresponding
to the same file object and returns different filehandles on two
different OPENs of the same file object, the server MUST NOT "OR"
together the access and deny bits and coalesce the two open files.
Instead the server must maintain separate OPENs with separate
stateids and will require separate CLOSEs to free them.
When multiple open files on the client are merged into a single open
file object on the server, the close of one of the open files (on the
client) may necessitate change of the access and deny status of the
open file on the server. This is because the union of the access and
deny bits for the remaining opens may be smaller (i.e., a proper
subset) than previously. The OPEN_DOWNGRADE operation is used to
make the necessary change and the client should use it to update the
server so that share reservation requests by other clients are
handled properly.
8.12. Short and Long Leases
When determining the time period for the server lease, the usual
lease tradeoffs apply. Short leases are good for fast server
recovery at a cost of increased RENEW or READ (with zero length)
requests. Longer leases are certainly kinder and gentler to servers
trying to handle very large numbers of clients. The number of RENEW
requests drop in proportion to the lease time. The disadvantages of
long leases are slower recovery after server failure (the server must
wait for the leases to expire and the grace period to elapse before
granting new lock requests) and increased file contention (if client
fails to transmit an unlock request then server must wait for lease
expiration before granting new locks).
Long leases are usable if the server is able to store lease state in
non-volatile memory. Upon recovery, the server can reconstruct the
lease state from its non-volatile memory and continue operation with
its clients and therefore long leases would not be an issue.
8.13. Clocks, Propagation Delay, and Calculating Lease Expiration
To avoid the need for synchronized clocks, lease times are granted by
the server as a time delta. However, there is a requirement that the
client and server clocks do not drift excessively over the duration
of the lock. There is also the issue of propagation delay across the
network which could easily be several hundred milliseconds as well as
the possibility that requests will be lost and need to be
retransmitted.
To take propagation delay into account, the client should subtract it
from lease times (e.g., if the client estimates the one-way
propagation delay as 200 msec, then it can assume that the lease is
already 200 msec old when it gets it). In addition, it will take
another 200 msec to get a response back to the server. So the client
must send a lock renewal or write data back to the server 400 msec
before the lease would expire.
The server's lease period configuration should take into account the
network distance of the clients that will be accessing the server's
resources. It is expected that the lease period will take into
account the network propagation delays and other network delay
factors for the client population. Since the protocol does not allow
for an automatic method to determine an appropriate lease period, the
server's administrator may have to tune the lease period.
8.14. Migration, Replication and State
When responsibility for handling a given file system is transferred
to a new server (migration) or the client chooses to use an alternate
server (e.g., in response to server unresponsiveness) in the context
of file system replication, the appropriate handling of state shared
between the client and server (i.e., locks, leases, stateids, and
clientids) is as described below. The handling differs between
migration and replication. For related discussion of file server
state and recover of such see the sections under "File Locking and
Share Reservations".
If server replica or a server immigrating a filesystem agrees to, or
is expected to, accept opaque values from the client that originated
from another server, then it is a wise implementation practice for
the servers to encode the "opaque" values in network byte order.
This way, servers acting as replicas or immigrating filesystems will
be able to parse values like stateids, directory cookies,
filehandles, etc. even if their native byte order is different from
other servers cooperating in the replication and migration of the
filesystem.
8.14.1. Migration and State
In the case of migration, the servers involved in the migration of a
filesystem SHOULD transfer all server state from the original to the
new server. This must be done in a way that is transparent to the
client. This state transfer will ease the client's transition when a
filesystem migration occurs. If the servers are successful in
transferring all state, the client will continue to use stateids
assigned by the original server. Therefore the new server must
recognize these stateids as valid. This holds true for the clientid
as well. Since responsibility for an entire filesystem is
transferred with a migration event, there is no possibility that
conflicts will arise on the new server as a result of the transfer of
locks.
As part of the transfer of information between servers, leases would
be transferred as well. The leases being transferred to the new
server will typically have a different expiration time from those for
the same client, previously on the old server. To maintain the
property that all leases on a given server for a given client expire
at the same time, the server should advance the expiration time to
the later of the leases being transferred or the leases already
present. This allows the client to maintain lease renewal of both
classes without special effort.
The servers may choose not to transfer the state information upon
migration. However, this choice is discouraged. In this case, when
the client presents state information from the original server (e.g.
in a RENEW op or a READ op of zero length), the client must be
prepared to receive either NFS4ERR_STALE_CLIENTID or
NFS4ERR_STALE_STATEID from the new server. The client should then
recover its state information as it normally would in response to a
server failure. The new server must take care to allow for the
recovery of state information as it would in the event of server
restart.
A client SHOULD re-establish new callback information with the new
server as soon as possible, according to sequences described in
sections Section 14.35 and Section 14.36. This ensures that server
operations are not blocked by the inability to recall delegations.
8.14.2. Replication and State
Since client switch-over in the case of replication is not under
server control, the handling of state is different. In this case,
leases, stateids and clientids do not have validity across a
transition from one server to another. The client must re-establish
its locks on the new server. This can be compared to the re-
establishment of locks by means of reclaim-type requests after a
server reboot. The difference is that the server has no provision to
distinguish requests reclaiming locks from those obtaining new locks
or to defer the latter. Thus, a client re-establishing a lock on the
new server (by means of a LOCK or OPEN request), may have the
requests denied due to a conflicting lock. Since replication is
intended for read-only use of filesystems, such denial of locks
should not pose large difficulties in practice. When an attempt to
re-establish a lock on a new server is denied, the client should
treat the situation as if his original lock had been revoked.
8.14.3. Notification of Migrated Lease
In the case of lease renewal, the client may not be submitting
requests for a filesystem that has been migrated to another server.
This can occur because of the implicit lease renewal mechanism. The
client renews leases for all filesystems when submitting a request to
any one filesystem at the server.
In order for the client to schedule renewal of leases that may have
been relocated to the new server, the client must find out about
lease relocation before those leases expire. To accomplish this, all
operations which implicitly renew leases for a client (i.e., OPEN,
CLOSE, READ, WRITE, RENEW, LOCK, LOCKT, LOCKU), will return the error
NFS4ERR_LEASE_MOVED if responsibility for any of the leases to be
renewed has been transferred to a new server. This condition will
continue until the client receives an NFS4ERR_MOVED error and the
server receives the subsequent GETATTR(fs_locations) for an access to
each filesystem for which a lease has been moved to a new server.
When a client receives an NFS4ERR_LEASE_MOVED error, it should
perform an operation on each filesystem associated with the server in
question. When the client receives an NFS4ERR_MOVED error, the
client can follow the normal process to obtain the new server
information (through the fs_locations attribute) and perform renewal
of those leases on the new server. If the server has not had state
transferred to it transparently, the client will receive either
NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from the new server,
as described above, and the client can then recover state information
as it does in the event of server failure.
8.14.4. Migration and the Lease_time Attribute
In order that the client may appropriately manage its leases in the
case of migration, the destination server must establish proper
values for the lease_time attribute.
When state is transferred transparently, that state should include
the correct value of the lease_time attribute. The lease_time
attribute on the destination server must never be less than that on
the source since this would result in premature expiration of leases
granted by the source server. Upon migration in which state is
transferred transparently, the client is under no obligation to re-
fetch the lease_time attribute and may continue to use the value
previously fetched (on the source server).
If state has not been transferred transparently (i.e., the client
sees a real or simulated server reboot), the client should fetch the
value of lease_time on the new (i.e., destination) server, and use it
for subsequent locking requests. However the server must respect a
grace period at least as long as the lease_time on the source server,
in order to ensure that clients have ample time to reclaim their
locks before potentially conflicting non-reclaimed locks are granted.
The means by which the new server obtains the value of lease_time on
the old server is left to the server implementations. It is not
specified by the NFS version 4 protocol.
9. Client-Side Caching
Client-side caching of data, of file attributes, and of file names is
essential to providing good performance with the NFS protocol.
Providing distributed cache coherence is a difficult problem and
previous versions of the NFS protocol have not attempted it.
Instead, several NFS client implementation techniques have been used
to reduce the problems that a lack of coherence poses for users.
These techniques have not been clearly defined by earlier protocol
specifications and it is often unclear what is valid or invalid
client behavior.
The NFS version 4 protocol uses many techniques similar to those that
have been used in previous protocol versions. The NFS version 4
protocol does not provide distributed cache coherence. However, it
defines a more limited set of caching guarantees to allow locks and
share reservations to be used without destructive interference from
client side caching.
In addition, the NFS version 4 protocol introduces a delegation
mechanism which allows many decisions normally made by the server to
be made locally by clients. This mechanism provides efficient
support of the common cases where sharing is infrequent or where
sharing is read-only.
9.1. Performance Challenges for Client-Side Caching
Caching techniques used in previous versions of the NFS protocol have
been successful in providing good performance. However, several
scalability challenges can arise when those techniques are used with
very large numbers of clients. This is particularly true when
clients are geographically distributed which classically increases
the latency for cache revalidation requests.
The previous versions of the NFS protocol repeat their file data
cache validation requests at the time the file is opened. This
behavior can have serious performance drawbacks. A common case is
one in which a file is only accessed by a single client. Therefore,
sharing is infrequent.
In this case, repeated reference to the server to find that no
conflicts exist is expensive. A better option with regards to
performance is to allow a client that repeatedly opens a file to do
so without reference to the server. This is done until potentially
conflicting operations from another client actually occur.
A similar situation arises in connection with file locking. Sending
file lock and unlock requests to the server as well as the read and
write requests necessary to make data caching consistent with the
locking semantics (see Section 9.3.2 "Data Caching and File Locking")
can severely limit performance. When locking is used to provide
protection against infrequent conflicts, a large penalty is incurred.
This penalty may discourage the use of file locking by applications.
The NFS version 4 protocol provides more aggressive caching
strategies with the following design goals:
o Compatibility with a large range of server semantics.
o Provide the same caching benefits as previous versions of the NFS
protocol when unable to provide the more aggressive model.
o Requirements for aggressive caching are organized so that a large
portion of the benefit can be obtained even when not all of the
requirements can be met.
The appropriate requirements for the server are discussed in later
sections in which specific forms of caching are covered. (see
Section 9.4 "Open Delegation").
9.2. Delegation and Callbacks
Recallable delegation of server responsibilities for a file to a
client improves performance by avoiding repeated requests to the
server in the absence of inter-client conflict. With the use of a
"callback" RPC from server to client, a server recalls delegated
responsibilities when another client engages in sharing of a
delegated file.
A delegation is passed from the server to the client, specifying the
object of the delegation and the type of delegation. There are
different types of delegations but each type contains a stateid to be
used to represent the delegation when performing operations that
depend on the delegation. This stateid is similar to those
associated with locks and share reservations but differs in that the
stateid for a delegation is associated with a clientid and may be
used on behalf of all the open_owners for the given client. A
delegation is made to the client as a whole and not to any specific
process or thread of control within it.
Because callback RPCs may not work in all environments (due to
firewalls, for example), correct protocol operation does not depend
on them. Preliminary testing of callback functionality by means of a
CB_NULL procedure determines whether callbacks can be supported. The
CB_NULL procedure checks the continuity of the callback path. A
server makes a preliminary assessment of callback availability to a
given client and avoids delegating responsibilities until it has
determined that callbacks are supported. Because the granting of a
delegation is always conditional upon the absence of conflicting
access, clients must not assume that a delegation will be granted and
they must always be prepared for OPENs to be processed without any
delegations being granted.
Once granted, a delegation behaves in most ways like a lock. There
is an associated lease that is subject to renewal together with all
of the other leases held by that client.
Unlike locks, an operation by a second client to a delegated file
will cause the server to recall a delegation through a callback.
On recall, the client holding the delegation must flush modified
state (such as modified data) to the server and return the
delegation. The conflicting request will not receive a response
until the recall is complete. The recall is considered complete when
the client returns the delegation or the server times out on the
recall and revokes the delegation as a result of the timeout.
Following the resolution of the recall, the server has the
information necessary to grant or deny the second client's request.
At the time the client receives a delegation recall, it may have
substantial state that needs to be flushed to the server. Therefore,
the server should allow sufficient time for the delegation to be
returned since it may involve numerous RPCs to the server. If the
server is able to determine that the client is diligently flushing
state to the server as a result of the recall, the server may extend
the usual time allowed for a recall. However, the time allowed for
recall completion should not be unbounded.
An example of this is when responsibility to mediate opens on a given
file is delegated to a client (see Section 9.4 "Open Delegation").
The server will not know what opens are in effect on the client.
Without this knowledge the server will be unable to determine if the
access and deny state for the file allows any particular open until
the delegation for the file has been returned.
A client failure or a network partition can result in failure to
respond to a recall callback. In this case, the server will revoke
the delegation which in turn will render useless any modified state
still on the client.
9.2.1. Delegation Recovery
There are three situations that delegation recovery must deal with:
o Client reboot or restart
o Server reboot or restart
o Network partition (full or callback-only)
In the event the client reboots or restarts, the failure to renew
leases will result in the revocation of record locks and share
reservations. Delegations, however, may be treated a bit
differently.
There will be situations in which delegations will need to be
reestablished after a client reboots or restarts. The reason for
this is the client may have file data stored locally and this data
was associated with the previously held delegations. The client will
need to reestablish the appropriate file state on the server.
To allow for this type of client recovery, the server MAY extend the
period for delegation recovery beyond the typical lease expiration
period. This implies that requests from other clients that conflict
with these delegations will need to wait. Because the normal recall
process may require significant time for the client to flush changed
state to the server, other clients need be prepared for delays that
occur because of a conflicting delegation. This longer interval
would increase the window for clients to reboot and consult stable
storage so that the delegations can be reclaimed. For open
delegations, such delegations are reclaimed using OPEN with a claim
type of CLAIM_DELEGATE_PREV. (See Section 9.5 "Data Caching and
Revocation" and Section 14.18 "Operation 18: OPEN" for discussion of
open delegation and the details of OPEN respectively).
A server MAY support a claim type of CLAIM_DELEGATE_PREV, but if it
does, it MUST NOT remove delegations upon SETCLIENTID_CONFIRM, and
instead MUST, for a period of time no less than that of the value of
the lease_time attribute, maintain the client's delegations to allow
time for the client to issue CLAIM_DELEGATE_PREV requests. The
server that supports CLAIM_DELEGATE_PREV MUST support the DELEGPURGE
operation.
When the server reboots or restarts, delegations are reclaimed (using
the OPEN operation with CLAIM_PREVIOUS) in a similar fashion to
record locks and share reservations. However, there is a slight
semantic difference. In the normal case if the server decides that a
delegation should not be granted, it performs the requested action
(e.g., OPEN) without granting any delegation. For reclaim, the
server grants the delegation but a special designation is applied so
that the client treats the delegation as having been granted but
recalled by the server. Because of this, the client has the duty to
write all modified state to the server and then return the
delegation. This process of handling delegation reclaim reconciles
three principles of the NFS version 4 protocol:
o Upon reclaim, a client reporting resources assigned to it by an
earlier server instance must be granted those resources.
o The server has unquestionable authority to determine whether
delegations are to be granted and, once granted, whether they are
to be continued.
o The use of callbacks is not to be depended upon until the client
has proven its ability to receive them.
When a network partition occurs, delegations are subject to freeing
by the server when the lease renewal period expires. This is similar
to the behavior for locks and share reservations. For delegations,
however, the server may extend the period in which conflicting
requests are held off. Eventually the occurrence of a conflicting
request from another client will cause revocation of the delegation.
A loss of the callback path (e.g., by later network configuration
change) will have the same effect. A recall request will fail and
revocation of the delegation will result.
A client normally finds out about revocation of a delegation when it
uses a stateid associated with a delegation and receives the error
NFS4ERR_EXPIRED. It also may find out about delegation revocation
after a client reboot when it attempts to reclaim a delegation and
receives that same error. Note that in the case of a revoked write
open delegation, there are issues because data may have been modified
by the client whose delegation is revoked and separately by other
clients. See Section 9.5.1 "Revocation Recovery for Write Open
Delegation" for a discussion of such issues. Note also that when
delegations are revoked, information about the revoked delegation
will be written by the server to stable storage (as described in
Section 8.6 "Crash Recovery"). This is done to deal with the case in
which a server reboots after revoking a delegation but before the
client holding the revoked delegation is notified about the
revocation.
9.3. Data Caching
When applications share access to a set of files, they need to be
implemented so as to take account of the possibility of conflicting
access by another application. This is true whether the applications
in question execute on different clients or reside on the same
client.
Share reservations and record locks are the facilities the NFS
version 4 protocol provides to allow applications to coordinate
access by providing mutual exclusion facilities. The NFS version 4
protocol's data caching must be implemented such that it does not
invalidate the assumptions that those using these facilities depend
upon.
9.3.1. Data Caching and OPENs
In order to avoid invalidating the sharing assumptions that
applications rely on, NFS version 4 clients should not provide cached
data to applications or modify it on behalf of an application when it
would not be valid to obtain or modify that same data via a READ or
WRITE operation.
Furthermore, in the absence of open delegation (see Section 9.4 "Open
Delegation") two additional rules apply. Note that these rules are
obeyed in practice by many NFS version 2 and version 3 clients.
o First, cached data present on a client must be revalidated after
doing an OPEN. Revalidating means that the client fetches the
change attribute from the server, compares it with the cached
change attribute, and if different, declares the cached data (as
well as the cached attributes) as invalid. This is to ensure that
the data for the OPENed file is still correctly reflected in the
client's cache. This validation must be done at least when the
client's OPEN operation includes DENY=WRITE or BOTH thus
terminating a period in which other clients may have had the
opportunity to open the file with WRITE access. Clients may
choose to do the revalidation more often (i.e., at OPENs
specifying DENY=NONE) to parallel the NFS version 3 protocol's
practice for the benefit of users assuming this degree of cache
revalidation. Since the change attribute is updated for data and
metadata modifications, some client implementors may be tempted to
use the time_modify attribute and not change to validate cached
data, so that metadata changes do not spuriously invalidate clean
data. The implementor is cautioned in this approach. The change
attribute is guaranteed to change for each update to the file,
whereas time_modify is guaranteed to change only at the
granularity of the time_delta attribute. Use by the client's data
cache validation logic of time_modify and not change runs the risk
of the client incorrectly marking stale data as valid.
o Second, modified data must be flushed to the server before closing
a file OPENed for write. This is complementary to the first rule.
If the data is not flushed at CLOSE, the revalidation done after
client OPENs as file is unable to achieve its purpose. The other
aspect to flushing the data before close is that the data must be
committed to stable storage, at the server, before the CLOSE
operation is requested by the client. In the case of a server
reboot or restart and a CLOSEd file, it may not be possible to
retransmit the data to be written to the file. Hence, this
requirement.