draft-ietf-nfsv4-minorversion2-07.txt   draft-ietf-nfsv4-minorversion2-08.txt 
NFSv4 T. Haynes NFSv4 T. Haynes
Internet-Draft Editor Internet-Draft Editor
Intended status: Standards Track January 04, 2012 Intended status: Standards Track April 25, 2012
Expires: July 7, 2012 Expires: October 27, 2012
NFS Version 4 Minor Version 2 NFS Version 4 Minor Version 2
draft-ietf-nfsv4-minorversion2-07.txt draft-ietf-nfsv4-minorversion2-08.txt
Abstract Abstract
This Internet-Draft describes NFS version 4 minor version two, This Internet-Draft describes NFS version 4 minor version two,
focusing mainly on the protocol extensions made from NFS version 4 focusing mainly on the protocol extensions made from NFS version 4
minor version 0 and NFS version 4 minor version 1. Major extensions minor version 0 and NFS version 4 minor version 1. Major extensions
introduced in NFS version 4 minor version two include: Server-side introduced in NFS version 4 minor version two include: Server-side
Copy, Space Reservations, and Support for Sparse Files. Copy, Space Reservations, and Support for Sparse Files.
Requirements Language Requirements Language
skipping to change at page 1, line 40 skipping to change at page 1, line 40
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on July 7, 2012. This Internet-Draft will expire on October 27, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 7 skipping to change at page 3, line 7
modifications of such material outside the IETF Standards Process. modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other it for publication as an RFC or to translate it into languages other
than English. than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 5 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 6
1.2. Scope of This Document . . . . . . . . . . . . . . . . . 5 1.2. Scope of This Document . . . . . . . . . . . . . . . . . 6
1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 6
1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 5 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 6
1.4.1. Sparse Files . . . . . . . . . . . . . . . . . . . . . 5 1.4.1. Sparse Files . . . . . . . . . . . . . . . . . . . . . 6
1.4.2. Application I/O Advise . . . . . . . . . . . . . . . . 6 1.4.2. Application I/O Advise . . . . . . . . . . . . . . . . 7
1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 6 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 7
2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 6 2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 7
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 6 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 7 2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 8
2.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 8 2.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 9
2.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 9 2.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 11
2.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 12 2.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 13
2.3. Operations . . . . . . . . . . . . . . . . . . . . . . . 14 2.3. Operations . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 14 2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 15
2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 15 2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 16
2.4. Security Considerations . . . . . . . . . . . . . . . . . 15 2.4. Security Considerations . . . . . . . . . . . . . . . . . 16
2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 15 2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 16
3. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 24 3. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 24 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 25
3.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 24 3.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 25
3.3. Determining the next hole/data . . . . . . . . . . . . . 25 3.3. Determining the next hole/data . . . . . . . . . . . . . 26
4. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 25 4. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 26
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 25 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 26
5. Support for Application IO Hints . . . . . . . . . . . . . . . 27 5. Support for Application IO Hints . . . . . . . . . . . . . . . 28
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 27 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 28
5.2. POSIX Requirements . . . . . . . . . . . . . . . . . . . 28 5.2. POSIX Requirements . . . . . . . . . . . . . . . . . . . 29
5.3. Additional Requirements . . . . . . . . . . . . . . . . . 29 5.3. Additional Requirements . . . . . . . . . . . . . . . . . 30
5.4. Security Considerations . . . . . . . . . . . . . . . . . 30 5.4. Security Considerations . . . . . . . . . . . . . . . . . 31
5.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 30 5.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 31
6. Application Data Block Support . . . . . . . . . . . . . . . . 30 6. Application Data Block Support . . . . . . . . . . . . . . . . 31
6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 31 6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 32
6.1.1. Data Block Representation . . . . . . . . . . . . . . 31 6.1.1. Data Block Representation . . . . . . . . . . . . . . 32
6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 32 6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 33
6.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 32 6.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 33
6.3. An Example of Detecting Corruption . . . . . . . . . . . 33 6.3. An Example of Detecting Corruption . . . . . . . . . . . 34
6.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 34 6.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 35
6.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 35 6.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 36
7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 35 7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 35 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 36
7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 36 7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 37
7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 37 7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 37
7.3.1. Interpreting FATTR4_SEC_LABEL . . . . . . . . . . . . 37 7.3.1. Interpreting FATTR4_SEC_LABEL . . . . . . . . . . . . 38
7.3.2. Delegations . . . . . . . . . . . . . . . . . . . . . 38 7.3.2. Delegations . . . . . . . . . . . . . . . . . . . . . 39
7.3.3. Permission Checking . . . . . . . . . . . . . . . . . 38 7.3.3. Permission Checking . . . . . . . . . . . . . . . . . 39
7.3.4. Object Creation . . . . . . . . . . . . . . . . . . . 39 7.3.4. Object Creation . . . . . . . . . . . . . . . . . . . 40
7.3.5. Existing Objects . . . . . . . . . . . . . . . . . . . 39 7.3.5. Existing Objects . . . . . . . . . . . . . . . . . . . 40
7.3.6. Label Changes . . . . . . . . . . . . . . . . . . . . 39 7.3.6. Label Changes . . . . . . . . . . . . . . . . . . . . 40
7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 40 7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 41
7.5. Discovery of Server LNFS Support . . . . . . . . . . . . 40 7.5. Discovery of Server LNFS Support . . . . . . . . . . . . 41
7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 41 7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 42
7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 41 7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 42
7.6.2. Smart Client Mode . . . . . . . . . . . . . . . . . . 42 7.6.2. Guest Mode . . . . . . . . . . . . . . . . . . . . . . 43
7.6.3. Smart Server Mode . . . . . . . . . . . . . . . . . . 43
7.7. Security Considerations . . . . . . . . . . . . . . . . . 44 7.7. Security Considerations . . . . . . . . . . . . . . . . . 44
8. Sharing change attribute implementation details with NFSv4 8. Sharing change attribute implementation details with NFSv4
clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 44 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 44
8.2. Definition of the 'change_attr_type' per-file system 8.2. Definition of the 'change_attr_type' per-file system
attribute . . . . . . . . . . . . . . . . . . . . . . . . 45 attribute . . . . . . . . . . . . . . . . . . . . . . . . 45
9. Security Considerations . . . . . . . . . . . . . . . . . . . 46 9. Security Considerations . . . . . . . . . . . . . . . . . . . 46
10. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 46 10. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 46
10.1. Attribute Definitions . . . . . . . . . . . . . . . . . . 46 10.1. Error Definitions . . . . . . . . . . . . . . . . . . . . 46
11. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 47 10.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 47
12. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 50 10.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 47
12.1. Operation 59: COPY - Initiate a server-side copy . . . . 50 10.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 47
12.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . 58 11. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 48
12.3. Operation 61: COPY_NOTIFY - Notify a source server of 11.1. Attribute Definitions . . . . . . . . . . . . . . . . . . 48
a future copy . . . . . . . . . . . . . . . . . . . . . . 59 12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 48
12.4. Operation 62: COPY_REVOKE - Revoke a destination 13. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 52
server's copy privileges . . . . . . . . . . . . . . . . 62 13.1. Operation 59: COPY - Initiate a server-side copy . . . . 52
12.5. Operation 63: COPY_STATUS - Poll for status of a 13.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . 59
server-side copy . . . . . . . . . . . . . . . . . . . . 63 13.3. Operation 61: COPY_NOTIFY - Notify a source server of
12.6. Modification to Operation 42: EXCHANGE_ID - a future copy . . . . . . . . . . . . . . . . . . . . . . 61
Instantiate Client ID . . . . . . . . . . . . . . . . . . 64 13.4. Operation 62: COPY_REVOKE - Revoke a destination
12.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . 65 server's copy privileges . . . . . . . . . . . . . . . . 63
12.8. Operation 67: IO_ADVISE - Application I/O access 13.5. Operation 63: COPY_STATUS - Poll for status of a
pattern hints . . . . . . . . . . . . . . . . . . . . . . 69 server-side copy . . . . . . . . . . . . . . . . . . . . 64
12.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 75 13.6. Modification to Operation 42: EXCHANGE_ID -
12.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 78 Instantiate Client ID . . . . . . . . . . . . . . . . . . 65
12.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 84 13.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . 66
13. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 86 13.8. Operation 67: IO_ADVISE - Application I/O access
13.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that pattern hints . . . . . . . . . . . . . . . . . . . . . . 70
13.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 76
13.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 79
13.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 85
14. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 86
14.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that
the File's Attributes Changed . . . . . . . . . . . . . . 86 the File's Attributes Changed . . . . . . . . . . . . . . 86
13.2. Operation 15: CB_COPY - Report results of a 14.2. Operation 15: CB_COPY - Report results of a
server-side copy . . . . . . . . . . . . . . . . . . . . 86 server-side copy . . . . . . . . . . . . . . . . . . . . 87
14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 88 15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 88
15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 88 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 88
15.1. Normative References . . . . . . . . . . . . . . . . . . 88 16.1. Normative References . . . . . . . . . . . . . . . . . . 88
15.2. Informative References . . . . . . . . . . . . . . . . . 89 16.2. Informative References . . . . . . . . . . . . . . . . . 89
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 91 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 90
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 91 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 91
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 91 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 91
1. Introduction 1. Introduction
1.1. The NFS Version 4 Minor Version 2 Protocol 1.1. The NFS Version 4 Minor Version 2 Protocol
The NFS version 4 minor version 2 (NFSv4.2) protocol is the third The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
minor version of the NFS version 4 (NFSv4) protocol. The first minor minor version of the NFS version 4 (NFSv4) protocol. The first minor
version, NFSv4.0, is described in [11] and the second minor version, version, NFSv4.0, is described in [10] and the second minor version,
NFSv4.1, is described in [2]. It follows the guidelines for minor NFSv4.1, is described in [2]. It follows the guidelines for minor
versioning that are listed in Section 11 of [11]. versioning that are listed in Section 11 of [10].
As a minor version, NFSv4.2 is consistent with the overall goals for As a minor version, NFSv4.2 is consistent with the overall goals for
NFSv4, but extends the protocol so as to better meet those goals, NFSv4, but extends the protocol so as to better meet those goals,
based on experiences with NFSv4.1. In addition, NFSv4.2 has adopted based on experiences with NFSv4.1. In addition, NFSv4.2 has adopted
some additional goals, which motivate some of the major extensions in some additional goals, which motivate some of the major extensions in
NFSv4.2. NFSv4.2.
1.2. Scope of This Document 1.2. Scope of This Document
This document describes the NFSv4.2 protocol. With respect to This document describes the NFSv4.2 protocol. With respect to
skipping to change at page 6, line 18 skipping to change at page 7, line 18
to communicate expected I/O behavior to the server. By communicating to communicate expected I/O behavior to the server. By communicating
future I/O behavior such as whether a file will be accessed future I/O behavior such as whether a file will be accessed
sequentially or randomly, and whether a file will or will not be sequentially or randomly, and whether a file will or will not be
accessed in the near future, servers can optimize future I/O requests accessed in the near future, servers can optimize future I/O requests
for a file by, for example, prefetching or evicting data. This for a file by, for example, prefetching or evicting data. This
operation can be used to support the posix_fadvise function as well operation can be used to support the posix_fadvise function as well
as other applications such as databases and video editors. as other applications such as databases and video editors.
1.5. Differences from NFSv4.1 1.5. Differences from NFSv4.1
[[Comment.3: This needs fleshing out! --TH]] In NFSv4.1, the only way to introduce new variants of an operation
was to introduce a new operation. I.e., READ becomes either READ2 or
READ_PLUS. With the use of discriminated unions as parameters to
such functions in NFSv4.2, it is possible to add a new arm in a
subsequent minor version. And it is also possible to move such an
operation from OPTIONAL/RECOMMENDED to REQUIRED. Forcing an
implementation to adopt each arm of a discriminated union at such a
time does not meet the spirit of the minor versioning rules. As
such, new arms of a discriminated union MUST follow the same
guidelines for minor versioning as operations in NFSv4.1 - i.e., they
may not be made REQUIRED. To support this, a new error code,
NFS4ERR_UNION_NOTSUPP, is introduced which allows the server to
communicate to the client that the operation is supported, but the
specific arm of the discriminated union is not.
2. NFS Server-side Copy 2. NFS Server-side Copy
2.1. Introduction 2.1. Introduction
This section describes a server-side copy feature for the NFS This section describes a server-side copy feature for the NFS
protocol. protocol.
The server-side copy feature provides a mechanism for the NFS client The server-side copy feature provides a mechanism for the NFS client
to perform a file copy on the server without the data being to perform a file copy on the server without the data being
skipping to change at page 7, line 28 skipping to change at page 8, line 37
server are the same server. Therefore in the context of an intra- server are the same server. Therefore in the context of an intra-
server copy, the terms source server and destination server refer to server copy, the terms source server and destination server refer to
the single server performing the copy. the single server performing the copy.
The operations described below are designed to copy files. Other The operations described below are designed to copy files. Other
file system objects can be copied by building on these operations or file system objects can be copied by building on these operations or
using other techniques. For example if the user wishes to copy a using other techniques. For example if the user wishes to copy a
directory, the client can synthesize a directory copy by first directory, the client can synthesize a directory copy by first
creating the destination directory and then copying the source creating the destination directory and then copying the source
directory's files to the new destination directory. If the user directory's files to the new destination directory. If the user
wishes to copy a namespace junction [12] [13], the client can use the wishes to copy a namespace junction [11] [12], the client can use the
ONC RPC Federated Filesystem protocol [13] to perform the copy. ONC RPC Federated Filesystem protocol [12] to perform the copy.
Specifically the client can determine the source junction's Specifically the client can determine the source junction's
attributes using the FEDFS_LOOKUP_FSN procedure and create a attributes using the FEDFS_LOOKUP_FSN procedure and create a
duplicate junction using the FEDFS_CREATE_JUNCTION procedure. duplicate junction using the FEDFS_CREATE_JUNCTION procedure.
For the inter-server copy protocol, the operations are defined to be For the inter-server copy protocol, the operations are defined to be
compatible with a server-to-server copy protocol in which the compatible with a server-to-server copy protocol in which the
destination server reads the file data from the source server. This destination server reads the file data from the source server. This
model in which the file data is pulled from the source by the model in which the file data is pulled from the source by the
destination has a number of advantages over a model in which the destination has a number of advantages over a model in which the
source pushes the file data to the destination. The advantages of source pushes the file data to the destination. The advantages of
skipping to change at page 13, line 28 skipping to change at page 14, line 28
of the source file to the destination file by replicating the file of the source file to the destination file by replicating the file
system formats at the block level. Another possibility is that the system formats at the block level. Another possibility is that the
source and destination might be two nodes sharing a common storage source and destination might be two nodes sharing a common storage
area network, and thus there is no need to copy any data at all, and area network, and thus there is no need to copy any data at all, and
instead ownership of the file and its contents might simply be re- instead ownership of the file and its contents might simply be re-
assigned to the destination. To allow for these possibilities, the assigned to the destination. To allow for these possibilities, the
destination server is allowed to use a server-to-server copy protocol destination server is allowed to use a server-to-server copy protocol
of its choice. of its choice.
In a heterogeneous environment, using a protocol other than NFSv4.x In a heterogeneous environment, using a protocol other than NFSv4.x
(e.g,. HTTP [14] or FTP [15]) presents some challenges. In (e.g,. HTTP [13] or FTP [14]) presents some challenges. In
particular, the destination server is presented with the challenge of particular, the destination server is presented with the challenge of
accessing the source file given only an NFSv4.x filehandle. accessing the source file given only an NFSv4.x filehandle.
One option for protocols that identify source files with path names One option for protocols that identify source files with path names
is to use an ASCII hexadecimal representation of the source is to use an ASCII hexadecimal representation of the source
filehandle as the file name. filehandle as the file name.
Another option for the source server is to use URLs to direct the Another option for the source server is to use URLs to direct the
destination server to a specialized service. For example, the destination server to a specialized service. For example, the
response to COPY_NOTIFY could include the URL response to COPY_NOTIFY could include the URL
skipping to change at page 15, line 19 skipping to change at page 16, line 19
A server may perform a copy offload operation asynchronously. An A server may perform a copy offload operation asynchronously. An
asynchronous copy is tracked using a copy offload stateid. Copy asynchronous copy is tracked using a copy offload stateid. Copy
offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS, offload stateids are included in the COPY, COPY_ABORT, COPY_STATUS,
and CB_COPY operations. and CB_COPY operations.
Section 8.2.4 of [2] specifies that stateids are valid until either Section 8.2.4 of [2] specifies that stateids are valid until either
(A) the client or server restart or (B) the client returns the (A) the client or server restart or (B) the client returns the
resource. resource.
A copy offload stateid will be valid until either (A) the client or A copy offload stateid will be valid until either (A) the client or
server restart or (B) the client returns the resource by issuing a server restarts or (B) the client returns the resource by issuing a
COPY_ABORT operation or the client replies to a CB_COPY operation. COPY_ABORT operation or the client replies to a CB_COPY operation.
A copy offload stateid's seqid MUST NOT be 0 (zero). In the context A copy offload stateid's seqid MUST NOT be 0 (zero). In the context
of a copy offload operation, it is ambiguous to indicate the most of a copy offload operation, it is ambiguous to indicate the most
recent copy offload operation using a stateid with seqid of 0 (zero). recent copy offload operation using a stateid with seqid of 0 (zero).
Therefore a copy offload stateid with seqid of 0 (zero) MUST be Therefore a copy offload stateid with seqid of 0 (zero) MUST be
considered invalid. considered invalid.
2.4. Security Considerations 2.4. Security Considerations
The security considerations pertaining to NFSv4 [11] apply to this The security considerations pertaining to NFSv4 [10] apply to this
document. document.
The standard security mechanisms provide by NFSv4 [11] may be used to The standard security mechanisms provide by NFSv4 [10] may be used to
secure the protocol described in this document. secure the protocol described in this document.
NFSv4 clients and servers supporting the the inter-server copy NFSv4 clients and servers supporting the the inter-server copy
operations described in this document are REQUIRED to implement [5], operations described in this document are REQUIRED to implement [5],
including the RPCSEC_GSSv3 privileges copy_from_auth and including the RPCSEC_GSSv3 privileges copy_from_auth and
copy_to_auth. If the server-to-server copy protocol is ONC RPC copy_to_auth. If the server-to-server copy protocol is ONC RPC
based, the servers are also REQUIRED to implement the RPCSEC_GSSv3 based, the servers are also REQUIRED to implement the RPCSEC_GSSv3
privilege copy_confirm_auth. These requirements to implement are not privilege copy_confirm_auth. These requirements to implement are not
requirements to use. NFSv4 clients and servers are RECOMMENDED to requirements to use. NFSv4 clients and servers are RECOMMENDED to
use [5] to secure server-side copy operations. use [5] to secure server-side copy operations.
skipping to change at page 22, line 25 skipping to change at page 23, line 25
2.4.1.2.4. Securing Non ONC RPC Server-to-Server Copy Protocols 2.4.1.2.4. Securing Non ONC RPC Server-to-Server Copy Protocols
If the destination won't be using ONC RPC to copy the data, then the If the destination won't be using ONC RPC to copy the data, then the
source and destination are using an unspecified copy protocol. The source and destination are using an unspecified copy protocol. The
destination could use the shared secret and the NFSv4 user id to destination could use the shared secret and the NFSv4 user id to
prove to the source server that the user principal has authorized the prove to the source server that the user principal has authorized the
copy. copy.
For protocols that authenticate user names with passwords (e.g., HTTP For protocols that authenticate user names with passwords (e.g., HTTP
[14] and FTP [15]), the nfsv4 user id could be used as the user name, [13] and FTP [14]), the nfsv4 user id could be used as the user name,
and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared
secret could be used as the user password or as input into non- secret could be used as the user password or as input into non-
password authentication methods like CHAP [16]. password authentication methods like CHAP [15].
2.4.1.3. Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3 2.4.1.3. Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3
ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the
server-side copy offload operations described in this document. In server-side copy offload operations described in this document. In
particular, host-based ONC RPC security flavors such as AUTH_NONE and particular, host-based ONC RPC security flavors such as AUTH_NONE and
AUTH_SYS MAY be used. If a host-based security flavor is used, a AUTH_SYS MAY be used. If a host-based security flavor is used, a
minimal level of protection for the server-to-server copy protocol is minimal level of protection for the server-to-server copy protocol is
possible. possible.
skipping to change at page 23, line 47 skipping to change at page 24, line 47
server identified in the COPY_NOTIFY. This random number technique server identified in the COPY_NOTIFY. This random number technique
only provides initial authentication of the destination server, and only provides initial authentication of the destination server, and
cannot defend against man-in-the-middle attacks after authentication cannot defend against man-in-the-middle attacks after authentication
or an eavesdropper that observes the random number on the wire. or an eavesdropper that observes the random number on the wire.
Other secure communication techniques (e.g., IPsec) are necessary to Other secure communication techniques (e.g., IPsec) are necessary to
block these attacks. block these attacks.
2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3
The same techniques as Section 2.4.1.3, using unique URLs for each The same techniques as Section 2.4.1.3, using unique URLs for each
destination server, can be used for other protocols (e.g., HTTP [14] destination server, can be used for other protocols (e.g., HTTP [13]
and FTP [15]) as well. and FTP [14]) as well.
3. Sparse Files 3. Sparse Files
3.1. Introduction 3.1. Introduction
A sparse file is a common way of representing a large file without A sparse file is a common way of representing a large file without
having to utilize all of the disk space for it. Consequently, a having to utilize all of the disk space for it. Consequently, a
sparse file uses less physical space than its size indicates. This sparse file uses less physical space than its size indicates. This
means the file contains 'holes', byte ranges within the file that means the file contains 'holes', byte ranges within the file that
contain no data. Most modern file systems support sparse files, contain no data. Most modern file systems support sparse files,
skipping to change at page 24, line 34 skipping to change at page 25, line 34
the zeroes to be transferred. the zeroes to be transferred.
A sparse file is typically created by initializing the file to be all A sparse file is typically created by initializing the file to be all
zeros - nothing is written to the data in the file, instead the hole zeros - nothing is written to the data in the file, instead the hole
is recorded in the metadata for the file. So a 8G disk image might is recorded in the metadata for the file. So a 8G disk image might
be represented initially by a couple hundred bits in the inode and be represented initially by a couple hundred bits in the inode and
nothing on the disk. If the VM then writes 100M to a file in the nothing on the disk. If the VM then writes 100M to a file in the
middle of the image, there would now be two holes represented in the middle of the image, there would now be two holes represented in the
metadata and 100M in the data. metadata and 100M in the data.
This section introduces a new operation READ_PLUS (Section 12.10) This section introduces a new operation READ_PLUS (Section 13.10)
which supports all the features of READ but includes an extension to which supports all the features of READ but includes an extension to
support sparse pattern files. READ_PLUS is guaranteed to perform no support sparse pattern files. READ_PLUS is guaranteed to perform no
worse than READ, and can dramatically improve performance with sparse worse than READ, and can dramatically improve performance with sparse
files. READ_PLUS does not depend on pNFS protocol features, but can files. READ_PLUS does not depend on pNFS protocol features, but can
be used by pNFS to support sparse files. be used by pNFS to support sparse files.
3.2. Terminology 3.2. Terminology
Regular file: An object of file type NF4REG or NF4NAMEDATTR. Regular file: An object of file type NF4REG or NF4NAMEDATTR.
skipping to change at page 28, line 15 skipping to change at page 29, line 15
This section adds a new IO_ADVISE operation to communicate the client This section adds a new IO_ADVISE operation to communicate the client
file access patterns to the NFS server. The NFS server upon file access patterns to the NFS server. The NFS server upon
receiving a IO_ADVISE operation MAY choose to alter its I/O and receiving a IO_ADVISE operation MAY choose to alter its I/O and
caching behavior, but is under no obligation to do so. caching behavior, but is under no obligation to do so.
5.2. POSIX Requirements 5.2. POSIX Requirements
The first key requirement of the IO_ADVISE operation is to support The first key requirement of the IO_ADVISE operation is to support
the posix_fadvise function [6], which is supported in Linux and many the posix_fadvise function [6], which is supported in Linux and many
other operating systems. Examples and guidance on how to use other operating systems. Examples and guidance on how to use
posix_fadvise to improve performance can be found here [17]. posix_fadvise to improve performance can be found here [16].
posix_fadvise is defined as follows, posix_fadvise is defined as follows,
int posix_fadvise(int fd, off_t offset, off_t len, int advice); int posix_fadvise(int fd, off_t offset, off_t len, int advice);
The posix_fadvise() function shall advise the implementation on the The posix_fadvise() function shall advise the implementation on the
expected behavior of the application with respect to the data in the expected behavior of the application with respect to the data in the
file associated with the open file descriptor, fd, starting at offset file associated with the open file descriptor, fd, starting at offset
and continuing for len bytes. The specified range need not currently and continuing for len bytes. The specified range need not currently
exist in the file. If len is zero, all data following offset is exist in the file. If len is zero, all data following offset is
specified. The implementation may use this information to optimize specified. The implementation may use this information to optimize
skipping to change at page 30, line 23 skipping to change at page 31, line 23
5.5. IANA Considerations 5.5. IANA Considerations
The IO_ADVISE_type4 will be extended through an IANA registry. The IO_ADVISE_type4 will be extended through an IANA registry.
6. Application Data Block Support 6. Application Data Block Support
At the OS level, files are contained on disk blocks. Applications At the OS level, files are contained on disk blocks. Applications
are also free to impose structure on the data contained in a file and are also free to impose structure on the data contained in a file and
we can define an Application Data Block (ADB) to be such a structure. we can define an Application Data Block (ADB) to be such a structure.
From the application's viewpoint, it only wants to handle ADBs and From the application's viewpoint, it only wants to handle ADBs and
not raw bytes (see [18]). An ADB is typically comprised of two not raw bytes (see [17]). An ADB is typically comprised of two
sections: a header and data. The header describes the sections: a header and data. The header describes the
characteristics of the block and can provide a means to detect characteristics of the block and can provide a means to detect
corruption in the data payload. The data section is typically corruption in the data payload. The data section is typically
initialized to all zeros. initialized to all zeros.
The format of the header is application specific, but there are two The format of the header is application specific, but there are two
main components typically encountered: main components typically encountered:
1. An ADB Number (ADBN), which allows the application to determine 1. An ADB Number (ADBN), which allows the application to determine
which data block is being referenced. The ADBN is a logical which data block is being referenced. The ADBN is a logical
block number and is useful when the client is not storing the block number and is useful when the client is not storing the
blocks in contiguous memory. blocks in contiguous memory.
2. Fields to describe the state of the ADB and a means to detect 2. Fields to describe the state of the ADB and a means to detect
block corruption. For both pieces of data, a useful property is block corruption. For both pieces of data, a useful property is
that allowed values be unique in that if passed across the that allowed values be unique in that if passed across the
network, corruption due to translation between big and little network, corruption due to translation between big and little
endian architectures are detectable. For example, 0xF0DEDEF0 has endian architectures are detectable. For example, 0xF0DEDEF0 has
the same bit pattern in both architectures. the same bit pattern in both architectures.
Applications already impose structures on files [18] and detect Applications already impose structures on files [17] and detect
corruption in data blocks [19]. What they are not able to do is corruption in data blocks [18]. What they are not able to do is
efficiently transfer and store ADBs. To initialize a file with ADBs, efficiently transfer and store ADBs. To initialize a file with ADBs,
the client must send the full ADB to the server and that must be the client must send the full ADB to the server and that must be
stored on the server. When the application is initializing a file to stored on the server. When the application is initializing a file to
have the ADB structure, it could compress the ADBs to just the have the ADB structure, it could compress the ADBs to just the
information to necessary to later reconstruct the header portion of information to necessary to later reconstruct the header portion of
the ADB when the contents are read back. Using sparse file the ADB when the contents are read back. Using sparse file
techniques, the disk blocks described by would not be allocated. techniques, the disk blocks described by would not be allocated.
Unlike sparse file techniques, there would be a small cost to store Unlike sparse file techniques, there would be a small cost to store
the compressed header data. the compressed header data.
skipping to change at page 31, line 27 skipping to change at page 32, line 27
6.1. Generic Framework 6.1. Generic Framework
We want the representation of the ADB to be flexible enough to We want the representation of the ADB to be flexible enough to
support many different applications. The most basic approach is no support many different applications. The most basic approach is no
imposition of a block at all, which means we are working with the raw imposition of a block at all, which means we are working with the raw
bytes. Such an approach would be useful for storing holes, punching bytes. Such an approach would be useful for storing holes, punching
holes, etc. In more complex deployments, a server might be holes, etc. In more complex deployments, a server might be
supporting multiple applications, each with their own definition of supporting multiple applications, each with their own definition of
the ADB. One might store the ADBN at the start of the block and then the ADB. One might store the ADBN at the start of the block and then
have a guard pattern to detect corruption [20]. The next might store have a guard pattern to detect corruption [19]. The next might store
the ADBN at an offset of 100 bytes within the block and have no guard the ADBN at an offset of 100 bytes within the block and have no guard
pattern at all. The point is that existing applications might pattern at all. The point is that existing applications might
already have well defined formats for their data blocks. already have well defined formats for their data blocks.
The guard pattern can be used to represent the state of the block, to The guard pattern can be used to represent the state of the block, to
protect against corruption, or both. Again, it needs to be able to protect against corruption, or both. Again, it needs to be able to
be placed anywhere within the ADB. be placed anywhere within the ADB.
We need to be able to represent the starting offset of the block and We need to be able to represent the starting offset of the block and
the size of the block. Note that nothing prevents the application the size of the block. Note that nothing prevents the application
skipping to change at page 32, line 48 skipping to change at page 33, line 48
impose on the MDS to asynchronously read the data from the DS. impose on the MDS to asynchronously read the data from the DS.
Furthermore, each DS MUST not report to a client either a sparse ADB Furthermore, each DS MUST not report to a client either a sparse ADB
or data which belongs to another DS. One implication of this or data which belongs to another DS. One implication of this
requirement is that the app_data_block4's adb_block_size MUST be requirement is that the app_data_block4's adb_block_size MUST be
either be the stripe width or the stripe width must be an even either be the stripe width or the stripe width must be an even
multiple of it. multiple of it.
The second implication here is that the DS must be able to use the The second implication here is that the DS must be able to use the
Control Protocol to determine from the MDS where the sparse ADBs Control Protocol to determine from the MDS where the sparse ADBs
occur. [[Comment.4: Need to discuss what happens if after the file occur. [[Comment.3: Need to discuss what happens if after the file
is being written to and an INITIALIZE occurs? --TH]] Perhaps instead is being written to and an INITIALIZE occurs? --TH]] Perhaps instead
of the DS pulling from the MDS, the MDS pushes to the DS? Thus an of the DS pulling from the MDS, the MDS pushes to the DS? Thus an
INITIALIZE causes a new push? [[Comment.5: Still need to consider INITIALIZE causes a new push? [[Comment.4: Still need to consider
race cases of the DS getting a WRITE and the MDS getting an race cases of the DS getting a WRITE and the MDS getting an
INITIALIZE. --TH]] INITIALIZE. --TH]]
6.3. An Example of Detecting Corruption 6.3. An Example of Detecting Corruption
In this section, we define an ADB format in which corruption can be In this section, we define an ADB format in which corruption can be
detected. Note that this is just one possible format and means to detected. Note that this is just one possible format and means to
detect corruption. detect corruption.
Consider a very basic implementation of an operating system's disk Consider a very basic implementation of an operating system's disk
skipping to change at page 33, line 38 skipping to change at page 34, line 38
0xcafedead - This is the DATA state and indicates that real data 0xcafedead - This is the DATA state and indicates that real data
has been written to this block. has been written to this block.
0xe4e5c001 - This is the INDIRECT state and indicates that the 0xe4e5c001 - This is the INDIRECT state and indicates that the
block contains block counter numbers that are chained off of this block contains block counter numbers that are chained off of this
block. block.
0xba1ed4a3 - This is the INVALID state and indicates that the block 0xba1ed4a3 - This is the INVALID state and indicates that the block
contains data whose contents are garbage. contains data whose contents are garbage.
Finally, it also defines an 8 byte checksum [21] starting at byte 16 Finally, it also defines an 8 byte checksum [20] starting at byte 16
which applies to the remaining contents of the block. If the state which applies to the remaining contents of the block. If the state
is FREE, then that checksum is trivially zero. As such, the is FREE, then that checksum is trivially zero. As such, the
application has no need to transfer the checksum implicitly inside application has no need to transfer the checksum implicitly inside
the ADB - it need not make the transfer layer aware of the fact that the ADB - it need not make the transfer layer aware of the fact that
there is a checksum (see [19] for an example of checksums used to there is a checksum (see [18] for an example of checksums used to
detect corruption in application data blocks). detect corruption in application data blocks).
Corruption in each ADB can be detected thusly: Corruption in each ADB can be detected thusly:
o If the guard pattern is anything other than one of the allowed o If the guard pattern is anything other than one of the allowed
values, including all zeros. values, including all zeros.
o If the guard pattern is FREE and any other byte in the remainder o If the guard pattern is FREE and any other byte in the remainder
of the ADB is anything other than zero. of the ADB is anything other than zero.
skipping to change at page 35, line 26 skipping to change at page 36, line 26
7. Labeled NFS 7. Labeled NFS
7.1. Introduction 7.1. Introduction
Access control models such as Unix permissions or Access Control Access control models such as Unix permissions or Access Control
Lists are commonly referred to as Discretionary Access Control (DAC) Lists are commonly referred to as Discretionary Access Control (DAC)
models. These systems base their access decisions on user identity models. These systems base their access decisions on user identity
and resource ownership. In contrast Mandatory Access Control (MAC) and resource ownership. In contrast Mandatory Access Control (MAC)
models base their access control decisions on the label on the models base their access control decisions on the label on the
subject (usually a process) and the object it wishes to access. subject (usually a process) and the object it wishes to access [7].
These labels may contain user identity information but usually These labels may contain user identity information but usually
contain additional information. In DAC systems users are free to contain additional information. In DAC systems users are free to
specify the access rules for resources that they own. MAC models specify the access rules for resources that they own. MAC models
base their security decisions on a system wide policy established by base their security decisions on a system wide policy established by
an administrator or organization which the users do not have the an administrator or organization which the users do not have the
ability to override. In this section, we add a MAC model to NFSv4. ability to override. In this section, we add a MAC model to NFSv4.
The first change necessary is to devise a method for transporting and The first change necessary is to devise a method for transporting and
storing security label data on NFSv4 file objects. Security labels storing security label data on NFSv4 file objects. Security labels
have several semantics that are met by NFSv4 recommended attributes have several semantics that are met by NFSv4 recommended attributes
skipping to change at page 36, line 8 skipping to change at page 37, line 8
The second change is to provide a method for the server to notify the The second change is to provide a method for the server to notify the
client that the attribute changed on an open file on the server. If client that the attribute changed on an open file on the server. If
the file is closed, then during the open attempt, the client will the file is closed, then during the open attempt, the client will
gather the new attribute value. The server MUST not communicate the gather the new attribute value. The server MUST not communicate the
new value of the attribute, the client MUST query it. This new value of the attribute, the client MUST query it. This
requirement stems from the need for the client to provide sufficient requirement stems from the need for the client to provide sufficient
access rights to the attribute. access rights to the attribute.
The final change necessary is a modification to the RPC layer used in The final change necessary is a modification to the RPC layer used in
NFSv4 in the form of a new version of the RPCSEC_GSS [7] framework. NFSv4 in the form of a new version of the RPCSEC_GSS [8] framework.
In order for an NFSv4 server to apply MAC checks it must obtain In order for an NFSv4 server to apply MAC checks it must obtain
additional information from the client. Several methods were additional information from the client. Several methods were
explored for performing this and it was decided that the best explored for performing this and it was decided that the best
approach was to incorporate the ability to make security attribute approach was to incorporate the ability to make security attribute
assertions through the RPC mechanism. RPCSECGSSv3 [5] outlines a assertions through the RPC mechanism. RPCSECGSSv3 [5] outlines a
method to assert additional security information such as security method to assert additional security information such as security
labels on gss context creation and have that data bound to all RPC labels on gss context creation and have that data bound to all RPC
requests that make use of that context. requests that make use of that context.
7.2. Definitions 7.2. Definitions
skipping to change at page 36, line 34 skipping to change at page 37, line 34
semantics of the label. semantics of the label.
Label Format Registry: is the IANA registry containing all Label Format Registry: is the IANA registry containing all
registered LFS along with references to the documents that registered LFS along with references to the documents that
describe the syntactic format and semantics of the security label. describe the syntactic format and semantics of the security label.
Policy Identifier (PI): is an optional part of the definition of a Policy Identifier (PI): is an optional part of the definition of a
Label Format Specifier which allows for clients and server to Label Format Specifier which allows for clients and server to
identify specific security policies. identify specific security policies.
Domain of Interpretation (DOI): represents an administrative
security boundary, where all systems within the DOI have
semantically coherent labeling. That is, a security attribute
must always mean exactly the same thing anywhere within the DOI.
Object: is a passive resource within the system that we wish to be Object: is a passive resource within the system that we wish to be
protected. Objects can be entities such as files, directories, protected. Objects can be entities such as files, directories,
pipes, sockets, and many other system resources relevant to the pipes, sockets, and many other system resources relevant to the
protection of the system state. protection of the system state.
Subject: A subject is an active entity usually a process which is Subject: A subject is an active entity usually a process which is
requesting access to an object. requesting access to an object.
Multi-Level Security (MLS): is a traditional model where objects are Multi-Level Security (MLS): is a traditional model where objects are
given a sensitivity level (Unclassified, Secret, Top Secret, etc) given a sensitivity level (Unclassified, Secret, Top Secret, etc)
and a category set [22]. and a category set [21].
7.3. MAC Security Attribute 7.3. MAC Security Attribute
MAC models base access decisions on security attributes bound to MAC models base access decisions on security attributes bound to
subjects and objects. This information can range from a user subjects and objects. This information can range from a user
identity for an identity based MAC model, sensitivity levels for identity for an identity based MAC model, sensitivity levels for
Multi-level security, or a type for Type Enforcement. These models Multi-level security, or a type for Type Enforcement. These models
base their decisions on different criteria but the semantics of the base their decisions on different criteria but the semantics of the
security attribute remain the same. The semantics required by the security attribute remain the same. The semantics required by the
security attributes are listed below: security attributes are listed below:
o Must provide flexibility with respect to MAC model. o Must provide flexibility with respect to MAC model.
o Must provide the ability to atomically set security information o Must provide the ability to atomically set security information
upon object creation upon object creation.
o Must provide the ability to enforce access control decisions both o Must provide the ability to enforce access control decisions both
on the client and the server on the client and the server.
o Must not expose an object to either the client or server name o Must not expose an object to either the client or server name
space before its security information has been bound to it. space before its security information has been bound to it.
NFSv4 implements the security attribute as a recommended attribute. NFSv4 implements the security attribute as a recommended attribute.
These attributes have a fixed format and semantics, which conflicts These attributes have a fixed format and semantics, which conflicts
with the flexible nature of the security attribute. To resolve this with the flexible nature of the security attribute. To resolve this
the security attribute consists of two components. The first the security attribute consists of two components. The first
component is a LFS as defined in [23] to allow for interoperability component is a LFS as defined in [22] to allow for interoperability
between MAC mechanisms. The second component is an opaque field between MAC mechanisms. The second component is an opaque field
which is the actual security attribute data. To allow for various which is the actual security attribute data. To allow for various
MAC models NFSv4 should be used solely as a transport mechanism for MAC models NFSv4 should be used solely as a transport mechanism for
the security attribute. It is the responsibility of the endpoints to the security attribute. It is the responsibility of the endpoints to
consume the security attribute and make access decisions based on consume the security attribute and make access decisions based on
their respective models. In addition, creation of objects through their respective models. In addition, creation of objects through
OPEN and CREATE allows for the security attribute to be specified OPEN and CREATE allows for the security attribute to be specified
upon creation. By providing an atomic create and set operation for upon creation. By providing an atomic create and set operation for
the security attribute it is possible to enforce the second and the security attribute it is possible to enforce the second and
fourth requirements. The recommended attribute FATTR4_SEC_LABEL will fourth requirements. The recommended attribute FATTR4_SEC_LABEL will
be used to satisfy this requirement. be used to satisfy this requirement.
7.3.1. Interpreting FATTR4_SEC_LABEL 7.3.1. Interpreting FATTR4_SEC_LABEL
The XDR [24] necessary to implement Labeled NFSv4 is presented below: The XDR [23] necessary to implement Labeled NFSv4 is presented below:
const FATTR4_SEC_LABEL = 81; const FATTR4_SEC_LABEL = 81;
typedef uint32_t policy4; typedef uint32_t policy4;
Figure 6 Figure 6
struct labelformat_spec4 { struct labelformat_spec4 {
policy4 lfs_lfs; policy4 lfs_lfs;
policy4 lfs_pi; policy4 lfs_pi;
skipping to change at page 38, line 21 skipping to change at page 39, line 21
labelformat_spec4 slai_lfs; labelformat_spec4 slai_lfs;
opaque slai_data<>; opaque slai_data<>;
}; };
The FATTR4_SEC_LABEL contains an array of two components with the The FATTR4_SEC_LABEL contains an array of two components with the
first component being an LFS. It serves to provide the receiving end first component being an LFS. It serves to provide the receiving end
with the information necessary to translate the security attribute with the information necessary to translate the security attribute
into a form that is usable by the endpoint. Label Formats assigned into a form that is usable by the endpoint. Label Formats assigned
an LFS may optionally choose to include a Policy Identifier field to an LFS may optionally choose to include a Policy Identifier field to
allow for complex policy deployments. The LFS and Label Format allow for complex policy deployments. The LFS and Label Format
Registry are described in detail in [23]. The translation used to Registry are described in detail in [22]. The translation used to
interpret the security attribute is not specified as part of the interpret the security attribute is not specified as part of the
protocol as it may depend on various factors. The second component protocol as it may depend on various factors. The second component
is an opaque section which contains the data of the attribute. This is an opaque section which contains the data of the attribute. This
component is dependent on the MAC model to interpret and enforce. component is dependent on the MAC model to interpret and enforce.
In particular, it is the responsibility of the LFS specification to In particular, it is the responsibility of the LFS specification to
define a maximum size for the opaque section, slai_data<>. When define a maximum size for the opaque section, slai_data<>. When
creating or modifying a label for an object, the client needs to be creating or modifying a label for an object, the client needs to be
guaranteed that the server will accept a label that is sized guaranteed that the server will accept a label that is sized
correctly. By both client and server being part of a specific MAC correctly. By both client and server being part of a specific MAC
skipping to change at page 40, line 7 skipping to change at page 41, line 7
the client may share the same object between multiple subjects or a the client may share the same object between multiple subjects or a
security system which is not strictly hierarchical that the security system which is not strictly hierarchical that the
CB_ATTR_CHANGED callback is very useful. It allows the server to CB_ATTR_CHANGED callback is very useful. It allows the server to
inform the clients that the cached security attribute is now stale. inform the clients that the cached security attribute is now stale.
Consider a system in which the clients enforce MAC checks and and the Consider a system in which the clients enforce MAC checks and and the
server has a very simple security system which just stores the server has a very simple security system which just stores the
labels. In this system, the MAC label check always allows access, labels. In this system, the MAC label check always allows access,
regardless of the subject label. regardless of the subject label.
The way in which MAC labels are enforced is by the smart client. So The way in which MAC labels are enforced is by the client. So if
if client A changes a security label on a file, then the server MUST client A changes a security label on a file, then the server MUST
inform all clients that have the file opened that the label has inform all clients that have the file opened that the label has
changed via CB_ATTR_CHANGED. Then the clients MUST retrieve the new changed via CB_ATTR_CHANGED. Then the clients MUST retrieve the new
label and MUST enforce access via the new attribute values. label and MUST enforce access via the new attribute values.
[[Comment.6: Describe a LFS of 0, which will be the means to indicate
such a deployment. In the current LFR, 0 is marked as reserved. If
we use it, then we define the default LFS to be used by a LNFS aware
server. I.e., it lets smart clients work together in the face of a
dumb server. Note that will supporting this system is optional, it
will make for a very good debugging mode during development. I.e.,
even if a server does not deploy with another security system, this
mode gets your foot in the door. --TH]]
7.4. pNFS Considerations 7.4. pNFS Considerations
This section examines the issues in deploying LNFS in a pNFS This section examines the issues in deploying LNFS in a pNFS
community of servers. community of servers.
7.4.1. MAC Label Checks 7.4.1. MAC Label Checks
The new FATTR4_SEC_LABEL attribute is metadata information and as The new FATTR4_SEC_LABEL attribute is metadata information and as
such the DS is not aware of the value contained on the MDS. such the DS is not aware of the value contained on the MDS.
Fortunately, the NFSv4.1 protocol [2] already has provisions for Fortunately, the NFSv4.1 protocol [2] already has provisions for
skipping to change at page 41, line 17 skipping to change at page 42, line 8
Note that the server might have imposed a security flavor on the root Note that the server might have imposed a security flavor on the root
that precludes such access. I.e., if the server requires kerberized that precludes such access. I.e., if the server requires kerberized
access and the client presents a compound with AUTH_SYS, then the access and the client presents a compound with AUTH_SYS, then the
server is allowed to return NFS4ERR_WRONGSEC in this case. But if server is allowed to return NFS4ERR_WRONGSEC in this case. But if
the client presents a correct security flavor, then the server MUST the client presents a correct security flavor, then the server MUST
return the FATTR4_SEC_LABEL attribute with the supported LFS filled return the FATTR4_SEC_LABEL attribute with the supported LFS filled
in. in.
7.6. MAC Security NFS Modes of Operation 7.6. MAC Security NFS Modes of Operation
A system using Labeled NFS may operate in three modes. The first A system using Labeled NFS may operate in two modes. The first mode
mode provides the most protection and is called "full mode". In this provides the most protection and is called "full mode". In this mode
mode both the client and server implement a MAC model allowing each both the client and server implement a MAC model allowing each end to
end to make an access control decision. The remaining two modes are make an access control decision. The remaining mode is called the
variations on each other and are called "smart client" and "smart "guest mode" and in this mode one end of the connection is not
server" modes. In these modes one end of the connection is not implementing a MAC model and thus offers less protection than full
implementing a MAC model and because of this these operating modes mode.
offer less protection than full mode.
7.6.1. Full Mode 7.6.1. Full Mode
Full mode environments consist of MAC aware NFSv4 servers and clients Full mode environments consist of MAC aware NFSv4 servers and clients
and may be composed of mixed MAC models and policies. The system and may be composed of mixed MAC models and policies. The system
requires that both the client and server have an opportunity to requires that both the client and server have an opportunity to
perform an access control check based on all relevant information perform an access control check based on all relevant information
within the network. The file object security attribute is provided within the network. The file object security attribute is provided
using the mechanism described in Section 7.3. The security attribute using the mechanism described in Section 7.3. The security attribute
of the subject making the request is transported at the RPC layer of the subject making the request is transported at the RPC layer
skipping to change at page 41, line 52 skipping to change at page 42, line 42
client to make a decision as to the acceptable security attributes to client to make a decision as to the acceptable security attributes to
create a file with before sending the request to the server. Once create a file with before sending the request to the server. Once
the server receives the creation request from the client it may the server receives the creation request from the client it may
choose to evaluate if the security attribute is acceptable. choose to evaluate if the security attribute is acceptable.
Security attributes on the client and server may vary based on MAC Security attributes on the client and server may vary based on MAC
model and policy. To handle this the security attribute field has an model and policy. To handle this the security attribute field has an
LFS component. This component is a mechanism for the host to LFS component. This component is a mechanism for the host to
identify the format and meaning of the opaque portion of the security identify the format and meaning of the opaque portion of the security
attribute. A full mode environment may contain hosts operating in attribute. A full mode environment may contain hosts operating in
several different LFSs and DOIs. In this case a mechanism for several different LFSs. In this case a mechanism for translating the
translating the opaque portion of the security attribute is needed. opaque portion of the security attribute is needed. The actual
The actual translation function will vary based on MAC model and translation function will vary based on MAC model and policy and is
policy and is out of the scope of this document. If a translation is out of the scope of this document. If a translation is unavailable
unavailable for a given LFS and DOI then the request SHOULD be for a given LFS then the request SHOULD be denied. Another recourse
denied. Another recourse is to allow the host to provide a fallback is to allow the host to provide a fallback mapping for unknown
mapping for unknown security attributes. security attributes.
7.6.1.2. Policy Enforcement 7.6.1.2. Policy Enforcement
In full mode access control decisions are made by both the clients In full mode access control decisions are made by both the clients
and servers. When a client makes a request it takes the security and servers. When a client makes a request it takes the security
attribute from the requesting process and makes an access control attribute from the requesting process and makes an access control
decision based on that attribute and the security attribute of the decision based on that attribute and the security attribute of the
object it is trying to access. If the client denies that access an object it is trying to access. If the client denies that access an
RPC call to the server is never made. If however the access is RPC call to the server is never made. If however the access is
allowed the client will make a call to the NFS server. allowed the client will make a call to the NFS server.
skipping to change at page 42, line 34 skipping to change at page 43, line 28
trying to access to make an access control decision. If the server's trying to access to make an access control decision. If the server's
policy allows this access it will fulfill the client's request, policy allows this access it will fulfill the client's request,
otherwise it will return NFS4ERR_ACCESS. otherwise it will return NFS4ERR_ACCESS.
Implementations MAY validate security attributes supplied over the Implementations MAY validate security attributes supplied over the
network to ensure that they are within a set of attributes permitted network to ensure that they are within a set of attributes permitted
from a specific peer, and if not, reject them. Note that a system from a specific peer, and if not, reject them. Note that a system
may permit a different set of attributes to be accepted from each may permit a different set of attributes to be accepted from each
peer. peer.
7.6.2. Smart Client Mode 7.6.1.3. Label Aware Only Server
Smart client environments consist of NFSv4 servers that are not MAC
aware but NFSv4 clients that are. Clients in this environment are
may consist of groups implementing different MAC models policies.
The system requires that all clients in the environment be
responsible for access control checks. Due to the amount of trust
placed in the clients this mode is only to be used in a trusted
environment.
7.6.2.1. Initial Labeling and Translation
Just like in full mode the client is responsible for determining the
initial label upon object creation. The server in smart client mode
does not implement a MAC model, however, it may provide the ability
to restrict the creation and labeling of object with certain labels
based on different criteria as described in Section 7.6.1.2.
In a smart client environment a group of clients operate in a single
DOI. This removes the need for the clients to maintain a set of DOI
translations. Servers should provide a method to allow different
groups of clients to access the server at the same time. However it
should not let two groups of clients operating in different DOIs to
access the same files.
7.6.2.2. Policy Enforcement
In smart client mode access control decisions are made by the If the LFS is 0, then it indicates a server which is label aware, but
clients. When a client accesses an object it obtains the security does not enforce policies. Such a server will store and retrieve all
attribute of the object from the server and combines it with the object labels presented by clients, notify the clients of any label
security attribute of the process making the request to make an changes via CB_ATTR_CHANGED, but will not restrict access via the
access control decision. This check is in addition to the DAC checks subject label. Instead, it will expect the clients to enforce all
provided by NFSv4 so this may fail based on the DAC criteria even if such access locally.
the MAC policy grants access. As the policy check is located on the
client an access control denial should take the form that is native
to the platform.
7.6.3. Smart Server Mode 7.6.2. Guest Mode
Smart server environments consist of NFSv4 servers that are MAC aware Guest mode implies that either the client or the server does not
and one or more MAC unaware clients. The server is the only entity handle labels. If the client is not LNFS aware, then it will not
offer subject labels to the server. The server is the only entity
enforcing policy, and may selectively provide standard NFS services enforcing policy, and may selectively provide standard NFS services
to clients based on their authentication credentials and/or to clients based on their authentication credentials and/or
associated network attributes (e.g., IP address, network interface). associated network attributes (e.g., IP address, network interface).
The level of trust and access extended to a client in this mode is The level of trust and access extended to a client in this mode is
configuration-specific. configuration-specific. If the server is not LNFS aware, then it
will not return object labels to the client. Clients in this
7.6.3.1. Initial Labeling and Translation environment are may consist of groups implementing different MAC
model policies. The system requires that all clients in the
In smart server mode all labeling and access control decisions are environment be responsible for access control checks.
performed by the NFSv4 server. In this environment the NFSv4 clients
are not MAC aware so they cannot provide input into the access
control decision. This requires the server to determine the initial
labeling of objects. Normally the subject to use in this calculation
would originate from the client. Instead the NFSv4 server may choose
to assign the subject security attribute based on their
authentication credentials and/or associated network attributes
(e.g., IP address, network interface).
In smart server mode security attributes are contained solely within
the NFSv4 server. This means that all security attributes used in
the system remain within a single LFS and DOI. Since security
attributes will not cross DOIs or change format there is no need to
provide any translation functionality above that which is needed
internally by the MAC model.
7.6.3.2. Policy Enforcement
All access control decisions in smart server mode are made by the
server. The server will assign the subject a security attribute
based on some criteria (e.g., IP address, network interface). Using
the newly calculated security attribute and the security attribute of
the object being requested the MAC model makes the access control
check and returns NFS4ERR_ACCESS on a denial and NFS4_OK on success.
This check is done transparently to the client so if the MAC
permission check fails the client may be unaware of the reason for
the permission failure. When operating in this mode administrators
attempting to debug permission failures should be aware to check the
MAC policy running on the server in addition to the DAC settings.
7.7. Security Considerations 7.7. Security Considerations
This entire document deals with security issues. This entire document deals with security issues.
Depending on the level of protection the MAC system offers there may Depending on the level of protection the MAC system offers there may
be a requirement to tightly bind the security attribute to the data. be a requirement to tightly bind the security attribute to the data.
When only one of the client or server enforces labels, it is When only one of the client or server enforces labels, it is
important to realize that the other side is not enforcing MAC important to realize that the other side is not enforcing MAC
skipping to change at page 44, line 42 skipping to change at page 44, line 28
An example of this is that a server that modifies READDIR or LOOKUP An example of this is that a server that modifies READDIR or LOOKUP
results based on the client's subject label might want to always results based on the client's subject label might want to always
construct the same subject label for a client which does not present construct the same subject label for a client which does not present
one. This will prevent a non-LNFS client from mixing entries in the one. This will prevent a non-LNFS client from mixing entries in the
directory cache. directory cache.
8. Sharing change attribute implementation details with NFSv4 clients 8. Sharing change attribute implementation details with NFSv4 clients
8.1. Introduction 8.1. Introduction
Although both the NFSv4 [11] and NFSv4.1 protocol [2], define the Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the
change attribute as being mandatory to implement, there is little in change attribute as being mandatory to implement, there is little in
the way of guidance. The only feature that is mandated by them is the way of guidance. The only feature that is mandated by them is
that the value must change whenever the file data or metadata change. that the value must change whenever the file data or metadata change.
While this allows for a wide range of implementations, it also leaves While this allows for a wide range of implementations, it also leaves
the client with a conundrum: how does it determine which is the most the client with a conundrum: how does it determine which is the most
recent value for the change attribute in a case where several RPC recent value for the change attribute in a case where several RPC
calls have been issued in parallel? In other words if two COMPOUNDs, calls have been issued in parallel? In other words if two COMPOUNDs,
both containing WRITE and GETATTR requests for the same file, have both containing WRITE and GETATTR requests for the same file, have
been issued in parallel, how does the client determine which of the been issued in parallel, how does the client determine which of the
skipping to change at page 46, line 7 skipping to change at page 45, line 44
preserved when writing to pNFS data servers. preserved when writing to pNFS data servers.
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute
value MUST be incremented by one unit for every atomic change to value MUST be incremented by one unit for every atomic change to
the file attributes, data or directory contents. In the case the file attributes, data or directory contents. In the case
where the client is writing to pNFS data servers, the number of where the client is writing to pNFS data servers, the number of
increments is not guaranteed to exactly match the number of increments is not guaranteed to exactly match the number of
writes. writes.
NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is
implemented as suggested in the NFSv4 spec [11] in terms of the implemented as suggested in the NFSv4 spec [10] in terms of the
time_metadata attribute. time_metadata attribute.
NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take
values that fit into any of these categories. values that fit into any of these categories.
If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR, If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR,
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or
NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at
the very least that the change attribute is monotonically increasing, the very least that the change attribute is monotonically increasing,
which is sufficient to resolve the question of which value is the which is sufficient to resolve the question of which value is the
skipping to change at page 46, line 34 skipping to change at page 46, line 22
Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it
has the ability to predict what the resulting change attribute value has the ability to predict what the resulting change attribute value
should be after a COMPOUND containing a SETATTR, WRITE, or CREATE. should be after a COMPOUND containing a SETATTR, WRITE, or CREATE.
This again allows it to detect changes made in parallel by another This again allows it to detect changes made in parallel by another
client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits
the same, but only if the client is not doing pNFS WRITEs. the same, but only if the client is not doing pNFS WRITEs.
9. Security Considerations 9. Security Considerations
10. File Attributes 10. Error Values
10.1. Attribute Definitions NFS error numbers are assigned to failed operations within a Compound
(COMPOUND or CB_COMPOUND) request. A Compound request contains a
number of NFS operations that have their results encoded in sequence
in a Compound reply. The results of successful operations will
consist of an NFS4_OK status followed by the encoded results of the
operation. If an NFS operation fails, an error status will be
entered in the reply and the Compound request will be terminated.
10.1.1. Attribute 77: space_reserved 10.1. Error Definitions
Protocol Error Definitions
+--------------------------+--------+------------------+
| Error | Number | Description |
+--------------------------+--------+------------------+
| NFS4ERR_BADLABEL | 10093 | Section 10.1.3.1 |
| NFS4ERR_MAC_ACCESS | 10094 | Section 10.1.3.2 |
| NFS4ERR_METADATA_NOTSUPP | 10090 | Section 10.1.2.1 |
| NFS4ERR_OFFLOAD_DENIED | 10091 | Section 10.1.2.2 |
| NFS4ERR_PARTNER_NO_AUTH | 10089 | Section 10.1.2.3 |
| NFS4ERR_PARTNER_NOTSUPP | 10088 | Section 10.1.2.4 |
| NFS4ERR_UNION_NOTSUPP | 10095 | Section 10.1.1.1 |
| NFS4ERR_WRONG_LFS | 10092 | Section 10.1.3.3 |
+--------------------------+--------+------------------+
Table 1
10.1.1. General Errors
This section deals with errors that are applicable to a broad set of
different purposes.
10.1.1.1. NFS4ERR_UNION_NOTSUPP (Error Code 10095)
One of the arguments to the operation is a discriminated union and
while the server supports the given operation, it does not support
the selected arm of the discriminated union. For an example, see
READ_PLUS (Section 13.10).
10.1.2. Server to Server Copy Errors
These errors deal with the interaction between server to server
copies.
10.1.2.1. NFS4ERR_METADATA_NOTSUPP (Error Code 10090)
The destination file cannot support the same metadata as the source
file.
10.1.2.2. NFS4ERR_OFFLOAD_DENIED (Error Code 10091)
The copy offload operation is supported by both the source and the
destination, but the destination is not allowing it for this file.
If the client sees this error, it should fall back to the normal copy
semantics.
10.1.2.3. NFS4ERR_PARTNER_NO_AUTH (Error Code 10089)
The remote server does not authorize a server-to-server copy offload
operation. This may be due to the client's failure to send the
COPY_NOTIFY operation to the remote server, the remote server
receiving a server-to-server copy offload request after the copy
lease time expired, or for some other permission problem.
10.1.2.4. NFS4ERR_PARTNER_NOTSUPP (Error Code 10088)
The remote server does not support the server-to-server copy offload
protocol.
10.1.3. Labeled NFS Errors
These errors are used in LNFS.
10.1.3.1. NFS4ERR_BADLABEL (Error Code 10093)
10.1.3.2. NFS4ERR_MAC_ACCESS (Error Code 10094)
10.1.3.3. NFS4ERR_WRONG_LFS (Error Code 10092)
11. File Attributes
11.1. Attribute Definitions
11.1.1. Attribute 77: space_reserved
The space_reserve attribute is a read/write attribute of type The space_reserve attribute is a read/write attribute of type
boolean. It is a per file attribute. When the space_reserved boolean. It is a per file attribute. When the space_reserved
attribute is set via SETATTR, the server must ensure that there is attribute is set via SETATTR, the server must ensure that there is
disk space to accommodate every byte in the file before it can return disk space to accommodate every byte in the file before it can return
success. If the server cannot guarantee this, it must return success. If the server cannot guarantee this, it must return
NFS4ERR_NOSPC. NFS4ERR_NOSPC.
If the client tries to grow a file which has the space_reserved If the client tries to grow a file which has the space_reserved
attribute set, the server must guarantee that there is disk space to attribute set, the server must guarantee that there is disk space to
skipping to change at page 47, line 21 skipping to change at page 48, line 41
The value of space_reserved can be obtained at any time through The value of space_reserved can be obtained at any time through
GETATTR. GETATTR.
In order to avoid ambiguity, the space_reserve bit cannot be set In order to avoid ambiguity, the space_reserve bit cannot be set
along with the size bit in SETATTR. Increasing the size of a file along with the size bit in SETATTR. Increasing the size of a file
with space_reserve set will fail if space reservation cannot be with space_reserve set will fail if space reservation cannot be
guaranteed for the new size. If the file size is decreased, space guaranteed for the new size. If the file size is decreased, space
reservation is only guaranteed for the new size and the extra blocks reservation is only guaranteed for the new size and the extra blocks
backing the file can be released. backing the file can be released.
10.1.2. Attribute 78: space_freed 11.1.2. Attribute 78: space_freed
space_freed gives the number of bytes freed if the file is deleted. space_freed gives the number of bytes freed if the file is deleted.
This attribute is read only and is of type length4. It is a per file This attribute is read only and is of type length4. It is a per file
attribute. attribute.
11. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL
The following tables summarize the operations of the NFSv4.2 protocol The following tables summarize the operations of the NFSv4.2 protocol
and the corresponding designation of REQUIRED, RECOMMENDED, and and the corresponding designation of REQUIRED, RECOMMENDED, and
OPTIONAL to implement or MUST NOT implement. The designation of MUST OPTIONAL to implement or MUST NOT implement. The designation of MUST
NOT implement is reserved for those operations that were defined in NOT implement is reserved for those operations that were defined in
either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2. either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2.
For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation
for operations sent by the client is for the server implementation. for operations sent by the client is for the server implementation.
The client is generally required to implement the operations needed The client is generally required to implement the operations needed
skipping to change at page 50, line 45 skipping to change at page 52, line 29
| CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS | | CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
| CB_RECALL_SLOT | REQ | | | CB_RECALL_SLOT | REQ | |
| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) | | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) |
| CB_SEQUENCE | OPT | FDELG, DDELG, pNFS | | CB_SEQUENCE | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
| CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS | | CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
+-------------------------+-------------------+---------------------+ +-------------------------+-------------------+---------------------+
12. NFSv4.2 Operations 13. NFSv4.2 Operations
12.1. Operation 59: COPY - Initiate a server-side copy 13.1. Operation 59: COPY - Initiate a server-side copy
12.1.1. ARGUMENT
13.1.1. ARGUMENT
const COPY4_GUARDED = 0x00000001; const COPY4_GUARDED = 0x00000001;
const COPY4_METADATA = 0x00000002; const COPY4_METADATA = 0x00000002;
struct COPY4args { struct COPY4args {
/* SAVED_FH: source file */ /* SAVED_FH: source file */
/* CURRENT_FH: destination file or */ /* CURRENT_FH: destination file or */
/* directory */ /* directory */
offset4 ca_src_offset; offset4 ca_src_offset;
offset4 ca_dst_offset; offset4 ca_dst_offset;
length4 ca_count; length4 ca_count;
uint32_t ca_flags; uint32_t ca_flags;
component4 ca_destination; component4 ca_destination;
netloc4 ca_source_server<>; netloc4 ca_source_server<>;
}; };
12.1.2. RESULT 13.1.2. RESULT
union COPY4res switch (nfsstat4 cr_status) { union COPY4res switch (nfsstat4 cr_status) {
case NFS4_OK: case NFS4_OK:
stateid4 cr_callback_id<1>; stateid4 cr_callback_id<1>;
default: default:
length4 cr_bytes_copied; length4 cr_bytes_copied;
}; };
12.1.3. DESCRIPTION 13.1.3. DESCRIPTION
The COPY operation is used for both intra-server and inter-server The COPY operation is used for both intra-server and inter-server
copies. In both cases, the COPY is always sent from the client to copies. In both cases, the COPY is always sent from the client to
the destination server of the file copy. The COPY operation requests the destination server of the file copy. The COPY operation requests
that a file be copied from the location specified by the SAVED_FH that a file be copied from the location specified by the SAVED_FH
value to the location specified by the combination of CURRENT_FH and value to the location specified by the combination of CURRENT_FH and
ca_destination. ca_destination.
The SAVED_FH must be a regular file. If SAVED_FH is not a regular The SAVED_FH must be a regular file. If SAVED_FH is not a regular
file, the operation MUST fail and return NFS4ERR_WRONG_TYPE. file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.
skipping to change at page 54, line 5 skipping to change at page 55, line 35
server, the behavior is implementation dependent. server, the behavior is implementation dependent.
If the metadata flag is set and the client is requesting a whole file If the metadata flag is set and the client is requesting a whole file
copy (i.e., ca_count is 0 (zero)), a subset of the destination file's copy (i.e., ca_count is 0 (zero)), a subset of the destination file's
attributes MUST be the same as the source file's corresponding attributes MUST be the same as the source file's corresponding
attributes and a subset of the destination file's attributes SHOULD attributes and a subset of the destination file's attributes SHOULD
be the same as the source file's corresponding attributes. The be the same as the source file's corresponding attributes. The
attributes in the MUST and SHOULD copy subsets will be defined for attributes in the MUST and SHOULD copy subsets will be defined for
each NFS version. each NFS version.
For NFSv4.1, Table 1 and Table 2 list the REQUIRED and RECOMMENDED For NFSv4.1, Table 2 and Table 3 list the REQUIRED and RECOMMENDED
attributes respectively. A "MUST" in the "Copy to destination file?" attributes respectively. A "MUST" in the "Copy to destination file?"
column indicates that the attribute is part of the MUST copy set. A column indicates that the attribute is part of the MUST copy set. A
"SHOULD" in the "Copy to destination file?" column indicates that the "SHOULD" in the "Copy to destination file?" column indicates that the
attribute is part of the SHOULD copy set. attribute is part of the SHOULD copy set.
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
| Name | Id | Copy to destination file? | | Name | Id | Copy to destination file? |
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
| supported_attrs | 0 | no | | supported_attrs | 0 | no |
| type | 1 | MUST | | type | 1 | MUST |
skipping to change at page 54, line 30 skipping to change at page 56, line 24
| symlink_support | 6 | no | | symlink_support | 6 | no |
| named_attr | 7 | no | | named_attr | 7 | no |
| fsid | 8 | no | | fsid | 8 | no |
| unique_handles | 9 | no | | unique_handles | 9 | no |
| lease_time | 10 | no | | lease_time | 10 | no |
| rdattr_error | 11 | no | | rdattr_error | 11 | no |
| filehandle | 19 | no | | filehandle | 19 | no |
| suppattr_exclcreat | 75 | no | | suppattr_exclcreat | 75 | no |
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
Table 1 Table 2
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
| Name | Id | Copy to destination file? | | Name | Id | Copy to destination file? |
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
| acl | 12 | MUST | | acl | 12 | MUST |
| aclsupport | 13 | no | | aclsupport | 13 | no |
| archive | 14 | no | | archive | 14 | no |
| cansettime | 15 | no | | cansettime | 15 | no |
| case_insensitive | 16 | no | | case_insensitive | 16 | no |
| case_preserving | 17 | no | | case_preserving | 17 | no |
skipping to change at page 56, line 4 skipping to change at page 57, line 46
| system | 46 | MUST | | system | 46 | MUST |
| time_access | 47 | MUST | | time_access | 47 | MUST |
| time_access_set | 48 | no | | time_access_set | 48 | no |
| time_backup | 49 | no | | time_backup | 49 | no |
| time_create | 50 | MUST | | time_create | 50 | MUST |
| time_delta | 51 | no | | time_delta | 51 | no |
| time_metadata | 52 | SHOULD | | time_metadata | 52 | SHOULD |
| time_modify | 53 | MUST | | time_modify | 53 | MUST |
| time_modify_set | 54 | no | | time_modify_set | 54 | no |
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
Table 2
Table 3
[NOTE: The source file's attribute values will take precedence over [NOTE: The source file's attribute values will take precedence over
any attribute values inherited by the destination file.] any attribute values inherited by the destination file.]
In the case of an inter-server copy or an intra-server copy between In the case of an inter-server copy or an intra-server copy between
file systems, the attributes supported for the source file and file systems, the attributes supported for the source file and
destination file could be different. By definition,the REQUIRED destination file could be different. By definition,the REQUIRED
attributes will be supported in all cases. If the metadata flag is attributes will be supported in all cases. If the metadata flag is
set and the source file has a RECOMMENDED attribute that is not set and the source file has a RECOMMENDED attribute that is not
supported for the destination file, the copy MUST fail with supported for the destination file, the copy MUST fail with
NFS4ERR_ATTRNOTSUPP. NFS4ERR_ATTRNOTSUPP.
Any attribute supported by the destination server that is not set on Any attribute supported by the destination server that is not set on
the source file SHOULD be left unset. the source file SHOULD be left unset.
skipping to change at page 57, line 26 skipping to change at page 59, line 21
destination file MUST appear identical to the NFS client. However, destination file MUST appear identical to the NFS client. However,
the NFS server's on disk representation of the data in the source the NFS server's on disk representation of the data in the source
file and destination file MAY differ. For example, the NFS server file and destination file MAY differ. For example, the NFS server
might encrypt, compress, deduplicate, or otherwise represent the on might encrypt, compress, deduplicate, or otherwise represent the on
disk data in the source and destination file differently. disk data in the source and destination file differently.
In the event of a failure the state of the destination file is In the event of a failure the state of the destination file is
implementation dependent. The COPY operation may fail for the implementation dependent. The COPY operation may fail for the
following reasons (this is a partial list). following reasons (this is a partial list).
NFS4ERR_MOVED: The file system which contains the source file, or o NFS4ERR_MOVED
the destination file or directory is not present. The client can
determine the correct location and reissue the operation with the
correct location.
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the
NFS server receiving this request.
NFS4ERR_PARTNER_NOTSUPP: The remote server does not support the o NFS4ERR_NOTSUPP
server-to-server copy offload protocol.
NFS4ERR_OFFLOAD_DENIED: The copy offload operation is supported by o NFS4ERR_PARTNER_NOTSUPP
both the source and the destination, but the destination is not
allowing it for this file. If the client sees this error, it
should fall back to the normal copy semantics.
NFS4ERR_PARTNER_NO_AUTH: The remote server does not authorize a o NFS4ERR_OFFLOAD_DENIED
server-to-server copy offload operation. This may be due to the
client's failure to send the COPY_NOTIFY operation to the remote
server, the remote server receiving a server-to-server copy
offload request after the copy lease time expired, or for some
other permission problem.
NFS4ERR_FBIG: The copy operation would have caused the file to grow o NFS4ERR_PARTNER_NO_AUTH
beyond the server's limit.
NFS4ERR_NOTDIR: The CURRENT_FH is a file and ca_destination has non- o NFS4ERR_FBIG
zero length.
NFS4ERR_WRONG_TYPE: The SAVED_FH is not a regular file. o NFS4ERR_NOTDIR
NFS4ERR_ISDIR: The CURRENT_FH is a directory and ca_destination has o NFS4ERR_WRONG_TYPE
zero length.
NFS4ERR_INVAL: The source offset or offset plus count are greater o NFS4ERR_ISDIR
than or equal to the size of the source file.
NFS4ERR_DELAY: The server does not have the resources to perform the o NFS4ERR_INVAL
copy operation at the current time. The client should retry the
operation sometime in the future.
NFS4ERR_METADATA_NOTSUPP: The destination file cannot support the o NFS4ERR_DELAY
same metadata as the source file.
NFS4ERR_WRONGSEC: The security mechanism being used by the client o NFS4ERR_METADATA_NOTSUPP
does not match the server's security policy.
12.2. Operation 60: COPY_ABORT - Cancel a server-side copy o NFS4ERR_WRONGSEC
12.2.1. ARGUMENT 13.2. Operation 60: COPY_ABORT - Cancel a server-side copy
13.2.1. ARGUMENT
struct COPY_ABORT4args { struct COPY_ABORT4args {
/* CURRENT_FH: desination file */ /* CURRENT_FH: desination file */
stateid4 caa_stateid; stateid4 caa_stateid;
}; };
12.2.2. RESULT 13.2.2. RESULT
struct COPY_ABORT4res { struct COPY_ABORT4res {
nfsstat4 car_status; nfsstat4 car_status;
}; };
12.2.3. DESCRIPTION 13.2.3. DESCRIPTION
COPY_ABORT is used for both intra- and inter-server asynchronous COPY_ABORT is used for both intra- and inter-server asynchronous
copies. The COPY_ABORT operation allows the client to cancel a copies. The COPY_ABORT operation allows the client to cancel a
server-side copy operation that it initiated. This operation is sent server-side copy operation that it initiated. This operation is sent
in a COMPOUND request from the client to the destination server. in a COMPOUND request from the client to the destination server.
This operation may be used to cancel a copy when the application that This operation may be used to cancel a copy when the application that
requested the copy exits before the operation is completed or for requested the copy exits before the operation is completed or for
some other reason. some other reason.
The request contains the filehandle and copy stateid cookies that act The request contains the filehandle and copy stateid cookies that act
skipping to change at page 59, line 21 skipping to change at page 60, line 42
operation was canceled and no callback will be issued by the server. operation was canceled and no callback will be issued by the server.
A copy operation that is successfully canceled may result in none, A copy operation that is successfully canceled may result in none,
some, or all of the data copied. some, or all of the data copied.
If the server supports asynchronous copies, the server is REQUIRED to If the server supports asynchronous copies, the server is REQUIRED to
support the COPY_ABORT operation. support the COPY_ABORT operation.
The COPY_ABORT operation may fail for the following reasons (this is The COPY_ABORT operation may fail for the following reasons (this is
a partial list): a partial list):
NFS4ERR_NOTSUPP: The abort operation is not supported by the NFS o NFS4ERR_NOTSUPP
server receiving this request.
NFS4ERR_RETRY: The abort failed, but a retry at some time in the o NFS4ERR_RETRY
future MAY succeed.
NFS4ERR_COMPLETE_ALREADY: The abort failed, and a callback will o NFS4ERR_COMPLETE_ALREADY
deliver the results of the copy operation.
NFS4ERR_SERVERFAULT: An error occurred on the server that does not o NFS4ERR_SERVERFAULT
map to a specific error code.
12.3. Operation 61: COPY_NOTIFY - Notify a source server of a future 13.3. Operation 61: COPY_NOTIFY - Notify a source server of a future
copy copy
12.3.1. ARGUMENT 13.3.1. ARGUMENT
struct COPY_NOTIFY4args { struct COPY_NOTIFY4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: source file */
netloc4 cna_destination_server; netloc4 cna_destination_server;
}; };
12.3.2. RESULT 13.3.2. RESULT
struct COPY_NOTIFY4resok { struct COPY_NOTIFY4resok {
nfstime4 cnr_lease_time; nfstime4 cnr_lease_time;
netloc4 cnr_source_server<>; netloc4 cnr_source_server<>;
}; };
union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
case NFS4_OK: case NFS4_OK:
COPY_NOTIFY4resok resok4; COPY_NOTIFY4resok resok4;
default: default:
void; void;
}; };
12.3.3. DESCRIPTION 13.3.3. DESCRIPTION
This operation is used for an inter-server copy. A client sends this This operation is used for an inter-server copy. A client sends this
operation in a COMPOUND request to the source server to authorize a operation in a COMPOUND request to the source server to authorize a
destination server identified by cna_destination_server to read the destination server identified by cna_destination_server to read the
file specified by CURRENT_FH on behalf of the given user. file specified by CURRENT_FH on behalf of the given user.
The cna_destination_server MUST be specified using the netloc4 The cna_destination_server MUST be specified using the netloc4
network location format. The server is not required to resolve the network location format. The server is not required to resolve the
cna_destination_server address before completing this operation. cna_destination_server address before completing this operation.
skipping to change at page 61, line 46 skipping to change at page 63, line 9
If the client wishes to perform an inter-server copy, the client MUST If the client wishes to perform an inter-server copy, the client MUST
send a COPY_NOTIFY to the source server. Therefore, the source send a COPY_NOTIFY to the source server. Therefore, the source
server MUST support COPY_NOTIFY. server MUST support COPY_NOTIFY.
For a copy only involving one server (the source and destination are For a copy only involving one server (the source and destination are
on the same server), this operation is unnecessary. on the same server), this operation is unnecessary.
The COPY_NOTIFY operation may fail for the following reasons (this is The COPY_NOTIFY operation may fail for the following reasons (this is
a partial list): a partial list):
NFS4ERR_MOVED: The file system which contains the source file is not o NFS4ERR_MOVED
present on the source server. The client can determine the
correct location and reissue the operation with the correct
location.
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the o NFS4ERR_NOTSUPP
NFS server receiving this request.
NFS4ERR_WRONGSEC: The security mechanism being used by the client o NFS4ERR_WRONGSEC
does not match the server's security policy.
12.4. Operation 62: COPY_REVOKE - Revoke a destination server's copy 13.4. Operation 62: COPY_REVOKE - Revoke a destination server's copy
privileges privileges
12.4.1. ARGUMENT 13.4.1. ARGUMENT
struct COPY_REVOKE4args { struct COPY_REVOKE4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: source file */
netloc4 cra_destination_server; netloc4 cra_destination_server;
}; };
12.4.2. RESULT 13.4.2. RESULT
struct COPY_REVOKE4res { struct COPY_REVOKE4res {
nfsstat4 crr_status; nfsstat4 crr_status;
}; };
12.4.3. DESCRIPTION 13.4.3. DESCRIPTION
This operation is used for an inter-server copy. A client sends this This operation is used for an inter-server copy. A client sends this
operation in a COMPOUND request to the source server to revoke the operation in a COMPOUND request to the source server to revoke the
authorization of a destination server identified by authorization of a destination server identified by
cra_destination_server from reading the file specified by CURRENT_FH cra_destination_server from reading the file specified by CURRENT_FH
on behalf of given user. If the cra_destination_server has already on behalf of given user. If the cra_destination_server has already
begun copying the file, a successful return from this operation begun copying the file, a successful return from this operation
indicates that further access will be prevented. indicates that further access will be prevented.
The cra_destination_server MUST be specified using the netloc4 The cra_destination_server MUST be specified using the netloc4
skipping to change at page 63, line 8 skipping to change at page 64, line 14
For a copy only involving one server (the source and destination are For a copy only involving one server (the source and destination are
on the same server), this operation is unnecessary. on the same server), this operation is unnecessary.
If the server supports COPY_NOTIFY, the server is REQUIRED to support If the server supports COPY_NOTIFY, the server is REQUIRED to support
the COPY_REVOKE operation. the COPY_REVOKE operation.
The COPY_REVOKE operation may fail for the following reasons (this is The COPY_REVOKE operation may fail for the following reasons (this is
a partial list): a partial list):
NFS4ERR_MOVED: The file system which contains the source file is not o NFS4ERR_MOVED
present on the source server. The client can determine the
correct location and reissue the operation with the correct
location.
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the o NFS4ERR_NOTSUPP
NFS server receiving this request.
12.5. Operation 63: COPY_STATUS - Poll for status of a server-side copy 13.5. Operation 63: COPY_STATUS - Poll for status of a server-side copy
12.5.1. ARGUMENT 13.5.1. ARGUMENT
struct COPY_STATUS4args { struct COPY_STATUS4args {
/* CURRENT_FH: destination file */ /* CURRENT_FH: destination file */
stateid4 csa_stateid; stateid4 csa_stateid;
}; };
12.5.2. RESULT 13.5.2. RESULT
struct COPY_STATUS4resok { struct COPY_STATUS4resok {
length4 csr_bytes_copied; length4 csr_bytes_copied;
nfsstat4 csr_complete<1>; nfsstat4 csr_complete<1>;
}; };
union COPY_STATUS4res switch (nfsstat4 csr_status) { union COPY_STATUS4res switch (nfsstat4 csr_status) {
case NFS4_OK: case NFS4_OK:
COPY_STATUS4resok resok4; COPY_STATUS4resok resok4;
default: default:
void; void;
}; };
12.5.3. DESCRIPTION 13.5.3. DESCRIPTION
COPY_STATUS is used for both intra- and inter-server asynchronous COPY_STATUS is used for both intra- and inter-server asynchronous
copies. The COPY_STATUS operation allows the client to poll the copies. The COPY_STATUS operation allows the client to poll the
server to determine the status of an asynchronous copy operation. server to determine the status of an asynchronous copy operation.
This operation is sent by the client to the destination server. This operation is sent by the client to the destination server.
If this operation is successful, the number of bytes copied are If this operation is successful, the number of bytes copied are
returned to the client in the csr_bytes_copied field. The returned to the client in the csr_bytes_copied field. The
csr_bytes_copied value indicates the number of bytes copied but not csr_bytes_copied value indicates the number of bytes copied but not
which specific bytes have been copied. which specific bytes have been copied.
skipping to change at page 64, line 20 skipping to change at page 65, line 22
The failure of this operation does not indicate the result of the The failure of this operation does not indicate the result of the
asynchronous copy in any way. asynchronous copy in any way.
If the server supports asynchronous copies, the server is REQUIRED to If the server supports asynchronous copies, the server is REQUIRED to
support the COPY_STATUS operation. support the COPY_STATUS operation.
The COPY_STATUS operation may fail for the following reasons (this is The COPY_STATUS operation may fail for the following reasons (this is
a partial list): a partial list):
NFS4ERR_NOTSUPP: The copy status operation is not supported by the o NFS4ERR_NOTSUPP
NFS server receiving this request.
NFS4ERR_BAD_STATEID: The stateid is not valid (see Section 2.3.2 o NFS4ERR_BAD_STATEID
below).
NFS4ERR_EXPIRED: The stateid has expired (see Copy Offload Stateid o NFS4ERR_EXPIRED
section below).
12.6. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID 13.6. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID
12.6.1. ARGUMENT 13.6.1. ARGUMENT
/* new */ /* new */
const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004; const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004;
12.6.2. RESULT 13.6.2. RESULT
Unchanged Unchanged
12.6.3. MOTIVATION 13.6.3. MOTIVATION
Enterprise applications require guarantees that an operation has Enterprise applications require guarantees that an operation has
either aborted or completed. NFSv4.1 provides this guarantee as long either aborted or completed. NFSv4.1 provides this guarantee as long
as the session is alive: simply send a SEQUENCE operation on the same as the session is alive: simply send a SEQUENCE operation on the same
slot with a new sequence number, and the successful return of slot with a new sequence number, and the successful return of
SEQUENCE indicates the previous operation has completed. However, if SEQUENCE indicates the previous operation has completed. However, if
the session is lost, there is no way to know when any in progress the session is lost, there is no way to know when any in progress
operations have aborted or completed. In hindsight, the NFSv4.1 operations have aborted or completed. In hindsight, the NFSv4.1
specification should have mandated that DESTROY_SESSION abort/ specification should have mandated that DESTROY_SESSION abort/
complete all outstanding operations. complete all outstanding operations.
12.6.4. DESCRIPTION 13.6.4. DESCRIPTION
A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
when it sends an EXCHANGE_ID operation. The server SHOULD set this when it sends an EXCHANGE_ID operation. The server SHOULD set this
capability in the EXCHANGE_ID reply whether the client requests it or capability in the EXCHANGE_ID reply whether the client requests it or
not. If the client ID is created with this capability then the not. If the client ID is created with this capability then the
following will occur: following will occur:
o The server will not reply to DESTROY_SESSION until all operations o The server will not reply to DESTROY_SESSION until all operations
in progress are completed or aborted. in progress are completed or aborted.
skipping to change at page 65, line 34 skipping to change at page 66, line 34
sessions, opens, locks, delegations, layouts, and/or wants are sessions, opens, locks, delegations, layouts, and/or wants are
deleted. deleted.
o The NFS server SHOULD support client ID trunking, and if it does o The NFS server SHOULD support client ID trunking, and if it does
and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
session ID created on one node of the storage cluster MUST be session ID created on one node of the storage cluster MUST be
destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID
and an EXCHANGE_ID with a new verifier affects all sessions and an EXCHANGE_ID with a new verifier affects all sessions
regardless what node the sessions were created on. regardless what node the sessions were created on.
12.7. Operation 64: INITIALIZE 13.7. Operation 64: INITIALIZE
This operation can be used to initialize the structure imposed by an This operation can be used to initialize the structure imposed by an
application onto a file and to punch a hole into a file. application onto a file, i.e., ADBs, and to punch a hole into a file.
The server has no concept of the structure imposed by the
application. It is only when the application writes to a section of
the file does order get imposed. In order to detect corruption even
before the application utilizes the file, the application will want
to initialize a range of ADBs. It uses the INITIALIZE operation to
do so.
12.7.1. ARGUMENT 13.7.1. ARGUMENT
/* /*
* We use data_content4 in case we wish to * We use data_content4 in case we wish to
* extend new types later. Note that we * extend new types later. Note that we
* are explicitly disallowing data. * are explicitly disallowing data.
*/ */
union initialize_arg4 switch (data_content4 content) { union initialize_arg4 switch (data_content4 content) {
case NFS4_CONTENT_APP_BLOCK: case NFS4_CONTENT_APP_BLOCK:
app_data_block4 ia_adb; app_data_block4 ia_adb;
case NFS4_CONTENT_HOLE: case NFS4_CONTENT_HOLE:
skipping to change at page 66, line 28 skipping to change at page 67, line 28
void; void;
}; };
struct INITIALIZE4args { struct INITIALIZE4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 ia_stateid; stateid4 ia_stateid;
stable_how4 ia_stable; stable_how4 ia_stable;
initialize_arg4 ia_data<>; initialize_arg4 ia_data<>;
}; };
12.7.2. RESULT 13.7.2. RESULT
struct INITIALIZE4resok { struct INITIALIZE4resok {
count4 ir_count; count4 ir_count;
stable_how4 ir_committed; stable_how4 ir_committed;
verifier4 ir_writeverf; verifier4 ir_writeverf;
data_content4 ir_sparse; data_content4 ir_sparse;
}; };
union INITIALIZE4res switch (nfsstat4 status) { union INITIALIZE4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
INITIALIZE4resok resok4; INITIALIZE4resok resok4;
default: default:
void; void;
}; };
12.7.3. DESCRIPTION 13.7.3. DESCRIPTION
13.7.3.1. Hole punching
When the client invokes the INITIALIZE operation, it has two desired
results:
1. The structure described by the app_data_block4 be imposed on the
file.
2. The contents described by the app_data_block4 be sparse.
If the server supports the INITIALIZE operation, it still might not
support sparse files. So if it receives the INITIALIZE operation,
then it MUST populate the contents of the file with the initialized
ADBs. In other words, if the server supports INITIALIZE, then it
supports the concept of ADBs. [[Comment.7: Do we want to support an
asynchronous INITIALIZE? Do we have to? --TH]] [[Comment.8: Need to
document union arm error code. --TH]]
If the data was already initialized, There are two interesting
scenarios:
1. The data blocks are allocated.
2. Initializing in the middle of an existing ADB.
If the data blocks were already allocated, then the INITIALIZE is a
hole punch operation. If INITIALIZE supports sparse files, then the
data blocks are to be deallocated. If not, then the data blocks are
to be rewritten in the indicated ADB format. [[Comment.9: Need to
document interaction between space reservation and hole punching?
--TH]]
Since the server has no knowledge of ADBs, it should not report
misaligned creation of ADBs. Even while it can detect them, it
cannot disallow them, as the application might be in the process of
changing the size of the ADBs. Thus the server must be prepared to
handle an INITIALIZE into an existing ADB.
This document does not mandate the manner in which the server stores
ADBs sparsely for a file. It does assume that if ADBs are stored
sparsely, then the server can detect when an INITIALIZE arrives that
will force a new ADB to start inside an existing ADB. For example,
assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
starts 1k inside ADBi. The server should [[Comment.10: Need to flesh
this out. --TH]]
12.7.3.1. Hole punching
Whenever a client wishes to deallocate the blocks backing a Whenever a client wishes to zero the blocks backing a particular
particular region in the file, it calls the INITIALIZE operation with region in the file, it calls the INITIALIZE operation with the
the current filehandle set to the filehandle of the file in question, current filehandle set to the filehandle of the file in question, and
start offset and length in bytes of the region set in hpa_offset and the equivalent of start offset and length in bytes of the region set
hpa_count respectively. All further reads to this region MUST return in ia_hole.di_offset and ia_hole.di_length respectively. If the
zeros until overwritten. The filehandle specified must be that of a ia_hole.di_allocated is set to TRUE, then the blocks will be zeroed
regular file. and if it is set to FALSE, then they will be deallocated. All
further reads to this region MUST return zeros until overwritten.
The filehandle specified must be that of a regular file.
Situations may arise where ia_hole.hi_offset and/or ia_hole.hi_offset Situations may arise where di_offset and/or di_offset + di_length
+ ia_hole.hi_length will not be aligned to a boundary that the server will not be aligned to a boundary that the server does allocations/
does allocations/ deallocations in. For most filesystems, this is deallocations in. For most filesystems, this is the block size of
the block size of the file system. In such a case, the server can the file system. In such a case, the server can deallocate as many
deallocate as many bytes as it can in the region. The blocks that bytes as it can in the region. The blocks that cannot be deallocated
cannot be deallocated MUST be zeroed. Except for the block MUST be zeroed. Except for the block deallocation and maximum hole
deallocation and maximum hole punching capability, a INITIALIZE punching capability, a INITIALIZE operation is to be treated similar
operation is to be treated similar to a write of zeroes. to a write of zeroes.
The server is not required to complete deallocating the blocks The server is not required to complete deallocating the blocks
specified in the operation before returning. It is acceptable to specified in the operation before returning. It is acceptable to
have the deallocation be deferred. In fact, INITIALIZE is merely a have the deallocation be deferred. In fact, INITIALIZE is merely a
hint; it is valid for a server to return success without ever doing hint; it is valid for a server to return success without ever doing
anything towards deallocating the blocks backing the region anything towards deallocating the blocks backing the region
specified. However, any future reads to the region MUST return specified. However, any future reads to the region MUST return
zeroes. zeroes.
If used to hole punch, INITIALIZE will result in the space_used If used to hole punch, INITIALIZE will result in the space_used
attribute being decreased by the number of bytes that were attribute being decreased by the number of bytes that were
deallocated. The space_freed attribute may or may not decrease, deallocated. The space_freed attribute may or may not decrease,
depending on the support and whether the blocks backing the specified depending on the support and whether the blocks backing the specified
range were shared or not. The size attribute will remain unchanged. range were shared or not. The size attribute will remain unchanged.
The INITIALIZE operation MUST NOT change the space reservation The INITIALIZE operation MUST NOT change the space reservation
guarantee of the file. While the server can deallocate the blocks guarantee of the file. While the server can deallocate the blocks
specified by hpa_offset and hpa_count, future writes to this region specified by di_offset and di_length, future writes to this region
MUST NOT fail with NFSERR_NOSPC. MUST NOT fail with NFSERR_NOSPC.
The INITIALIZE operation may fail for the following reasons (this is The INITIALIZE operation may fail for the following reasons (this is
a partial list): a partial list):
NFS4ERR_NOTSUPP The Hole punch operations are not supported by the NFS4ERR_NOTSUPP The Hole punch operations are not supported by the
NFS server receiving this request. NFS server receiving this request.
NFS4ERR_DIR The current filehandle is of type NF4DIR. NFS4ERR_DIR The current filehandle is of type NF4DIR.
NFS4ERR_SYMLINK The current filehandle is of type NF4LNK. NFS4ERR_SYMLINK The current filehandle is of type NF4LNK.
NFS4ERR_WRONG_TYPE The current filehandle does not designate an NFS4ERR_WRONG_TYPE The current filehandle does not designate an
ordinary file. ordinary file.
12.8. Operation 67: IO_ADVISE - Application I/O access pattern hints 13.7.3.2. ADBs
If the server supports ADBs, then it MUST support the
NFS4_CONTENT_APP_BLOCK arm of the INITIALIZE operation. The server
has no concept of the structure imposed by the application. It is
only when the application writes to a section of the file does order
get imposed. In order to detect corruption even before the
application utilizes the file, the application will want to
initialize a range of ADBs using INITIALIZE.
For ADBs, when the client invokes the INITIALIZE operation, it has
two desired results:
1. The structure described by the app_data_block4 be imposed on the
file.
2. The contents described by the app_data_block4 be sparse.
If the server supports the INITIALIZE operation, it still might not
support sparse files. So if it receives the INITIALIZE operation,
then it MUST populate the contents of the file with the initialized
ADBs.
If the data was already initialized, there are two interesting
scenarios:
1. The data blocks are allocated.
2. Initializing in the middle of an existing ADB.
If the data blocks were already allocated, then the INITIALIZE is a
hole punch operation. If INITIALIZE supports sparse files, then the
data blocks are to be deallocated. If not, then the data blocks are
to be rewritten in the indicated ADB format.
Since the server has no knowledge of ADBs, it should not report
misaligned creation of ADBs. Even while it can detect them, it
cannot disallow them, as the application might be in the process of
changing the size of the ADBs. Thus the server must be prepared to
handle an INITIALIZE into an existing ADB.
This document does not mandate the manner in which the server stores
ADBs sparsely for a file. It does assume that if ADBs are stored
sparsely, then the server can detect when an INITIALIZE arrives that
will force a new ADB to start inside an existing ADB. For example,
assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
starts 1k inside ADBi. The server should [[Comment.5: Need to flesh
this out. --TH]]
13.8. Operation 67: IO_ADVISE - Application I/O access pattern hints
This section introduces a new operation, named IO_ADVISE, which This section introduces a new operation, named IO_ADVISE, which
allows NFS clients to communicate application I/O access pattern allows NFS clients to communicate application I/O access pattern
hints to the NFS server. This new operation will allow hints to be hints to the NFS server. This new operation will allow hints to be
sent to the server when applications use posix_fadvise, direct I/O, sent to the server when applications use posix_fadvise, direct I/O,
or at any other point at which the client finds useful. or at any other point at which the client finds useful.
12.8.1. ARGUMENT 13.8.1. ARGUMENT
enum IO_ADVISE_type4 { enum IO_ADVISE_type4 {
IO_ADVISE4_NORMAL = 0, IO_ADVISE4_NORMAL = 0,
IO_ADVISE4_SEQUENTIAL = 1, IO_ADVISE4_SEQUENTIAL = 1,
IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2, IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2,
IO_ADVISE4_RANDOM = 3, IO_ADVISE4_RANDOM = 3,
IO_ADVISE4_WILLNEED = 4, IO_ADVISE4_WILLNEED = 4,
IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5, IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5,
IO_ADVISE4_DONTNEED = 6, IO_ADVISE4_DONTNEED = 6,
IO_ADVISE4_NOREUSE = 7, IO_ADVISE4_NOREUSE = 7,
IO_ADVISE4_READ = 8, IO_ADVISE4_READ = 8,
IO_ADVISE4_WRITE = 9 IO_ADVISE4_WRITE = 9,
IO_ADVISE4_INIT_PROXIMITY = 10
}; };
struct IO_ADVISE4args { struct IO_ADVISE4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 iar_stateid; stateid4 iar_stateid;
offset4 iar_offset; offset4 iar_offset;
length4 iar_count; length4 iar_count;
bitmap4 iar_hints; bitmap4 iar_hints;
}; };
12.8.2. RESULT 13.8.2. RESULT
struct IO_ADVISE4resok { struct IO_ADVISE4resok {
bitmap4 ior_hints; bitmap4 ior_hints;
}; };
union IO_ADVISE4res switch (nfsstat4 _status) { union IO_ADVISE4res switch (nfsstat4 _status) {
case NFS4_OK: case NFS4_OK:
IO_ADVISE4resok resok4; IO_ADVISE4resok resok4;
default: default:
void; void;
}; };
12.8.3. DESCRIPTION 13.8.3. DESCRIPTION
The IO_ADVISE operation sends an I/O access pattern hint to the The IO_ADVISE operation sends an I/O access pattern hint to the
server for the owner of stated for a given byte range specified by server for the owner of stated for a given byte range specified by
iar_offset and iar_count. The byte range specified by iar_offset and iar_offset and iar_count. The byte range specified by iar_offset and
iar_count need not currently exist in the file, but the iar_hints iar_count need not currently exist in the file, but the iar_hints
will apply to the byte range when it does exist. If iar_count is 0, will apply to the byte range when it does exist. If iar_count is 0,
all data following iar_offset is specified. The server MAY ignore all data following iar_offset is specified. The server MAY ignore
the advice. the advice.
The following are the possible hints: The following are the possible hints:
skipping to change at page 70, line 52 skipping to change at page 72, line 22
IO_ADVISE_NOREUSE Specifies that the stated holder expects to access IO_ADVISE_NOREUSE Specifies that the stated holder expects to access
the specified data once and then not reuse it thereafter. the specified data once and then not reuse it thereafter.
IO_ADVISE4_READ Specifies that the stated holder expects to read the IO_ADVISE4_READ Specifies that the stated holder expects to read the
specified data in the near future. specified data in the near future.
IO_ADVISE4_WRITE Specifies that the stated holder expects to write IO_ADVISE4_WRITE Specifies that the stated holder expects to write
the specified data in the near future. the specified data in the near future.
IO_ADVISE4_INIT_PROXIMITY The client has recently accessed the byte
range in its own cache. This informs the server that the data in
the byte range remains important to the client. When the server
reaches resource exhaustion, knowing which data is more important
allows the server to make better choices about which data to, for
example purge from a cache, or move to secondary storage. It also
informs the server which delegations are more important, since if
delegations are working correctly, once delegated to a client, a
server might never receive another I/O request for the file.
The server will return success if the operation is properly formed, The server will return success if the operation is properly formed,
otherwise the server will return an error. The server MUST NOT otherwise the server will return an error. The server MUST NOT
return an error if it does not recognize or does not support the return an error if it does not recognize or does not support the
requested advice. This is also true even if the client sends requested advice. This is also true even if the client sends
contradictory hints to the server, e.g., IO_ADVISE4_SEQUENTIAL and contradictory hints to the server, e.g., IO_ADVISE4_SEQUENTIAL and
IO_ADVISE4_RANDOM in a single IO_ADVISE operation. In this case, the IO_ADVISE4_RANDOM in a single IO_ADVISE operation. In this case, the
server MUST return success and a ior_hints value that indicates the server MUST return success and a ior_hints value that indicates the
hint it intends to optimize. For contradictory hints, this may mean hint it intends to optimize. For contradictory hints, this may mean
simply returning IO_ADVISE4_NORMAL for example. simply returning IO_ADVISE4_NORMAL for example.
skipping to change at page 71, line 38 skipping to change at page 73, line 18
perhaps due to a temporary resource limitation. perhaps due to a temporary resource limitation.
Each issuance of the IO_ADVISE operation overrides all previous Each issuance of the IO_ADVISE operation overrides all previous
issuances of IO_ADVISE for a given byte range. This effectively issuances of IO_ADVISE for a given byte range. This effectively
follows a strategy of last hint wins for a given stated and byte follows a strategy of last hint wins for a given stated and byte
range. range.
Clients should assume that hints included in an IO_ADVISE operation Clients should assume that hints included in an IO_ADVISE operation
will be forgotten once the file is closed. will be forgotten once the file is closed.
12.8.4. IMPLEMENTATION 13.8.4. IMPLEMENTATION
The NFS client may choose to issue and IO_ADVISE operation to the The NFS client may choose to issue an IO_ADVISE operation to the
server in several different instances. server in several different instances.
The most obvious is in direct response to an applications execution The most obvious is in direct response to an applications execution
of posix_fadvise. In this case, IO_ADVISE4_WRITE and IO_ADVISE4_READ of posix_fadvise. In this case, IO_ADVISE4_WRITE and IO_ADVISE4_READ
may be set based upon the type of file access specified when the file may be set based upon the type of file access specified when the file
was opened. was opened.
Another useful point would be when an application indicates it is Another useful point would be when an application indicates it is
using direct I/O. Direct I/O may be specified at file open, in which using direct I/O. Direct I/O may be specified at file open, in which
case a IO_ADVISE may be included in the same compound as the OPEN case a IO_ADVISE may be included in the same compound as the OPEN
operation with the IO_ADVISE4_NOREUSE flag set. Direct I/O may also operation with the IO_ADVISE4_NOREUSE flag set. Direct I/O may also
be specified separately, in which case a IO_ADVISE operation can be be specified separately, in which case a IO_ADVISE operation can be
sent to the server separately. As above, IO_ADVISE4_WRITE and sent to the server separately. As above, IO_ADVISE4_WRITE and
IO_ADVISE4_READ may be set based upon the type of file access IO_ADVISE4_READ may be set based upon the type of file access
specified when the file was opened. specified when the file was opened.
12.8.5. pNFS File Layout Data Type Considerations 13.8.5. pNFS File Layout Data Type Considerations
The IO_ADVISE considerations for pNFS are very similar to the COMMIT The IO_ADVISE considerations for pNFS are very similar to the COMMIT
considerations for pNFS. That is, as with COMMIT, some NFS server considerations for pNFS. That is, as with COMMIT, some NFS server
implementations prefer IO_ADVISE be done on the DS, and some prefer implementations prefer IO_ADVISE be done on the DS, and some prefer
it be done on the MDS. it be done on the MDS.
So for the file's layout type, it is proposed that NFSv4.2 include an So for the file's layout type, it is proposed that NFSv4.2 include an
additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on
NFSv4.2 or higher. Any file's layout obtained with NFSv4.1 MUST NOT NFSv4.2 or higher. Any file's layout obtained with NFSv4.1 MUST NOT
have NFL42_UFLG_IO_ADVISE_THRU_MDS set. Any file's layout obtained have NFL42_UFLG_IO_ADVISE_THRU_MDS set. Any file's layout obtained
skipping to change at page 72, line 42 skipping to change at page 74, line 23
client's intended use of the file, then the client SHOULD send an client's intended use of the file, then the client SHOULD send an
IO_ADVISE to each DS. While the client MAY always send IO_ADVISE to IO_ADVISE to each DS. While the client MAY always send IO_ADVISE to
the MDS, if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the the MDS, if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the
client should expect that such an IO_ADVISE is futile. Note that a client should expect that such an IO_ADVISE is futile. Note that a
client SHOULD use the same set of arguments on each IO_ADVISE sent to client SHOULD use the same set of arguments on each IO_ADVISE sent to
a DS for the same open file reference. a DS for the same open file reference.
The server is not required to support different advice for different The server is not required to support different advice for different
DS's with the same open file reference. DS's with the same open file reference.
12.8.5.1. Dense and Sparse Packing Considerations 13.8.5.1. Dense and Sparse Packing Considerations
The IO_ADVISE operation MUST use the iar_offset and byte range as The IO_ADVISE operation MUST use the iar_offset and byte range as
dictated by the presence or absence of NFL4_UFLG_DENSE. dictated by the presence or absence of NFL4_UFLG_DENSE.
E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS
for iar_offset 0 really means iar_offset 10000 in the logical file, for iar_offset 0 really means iar_offset 10000 in the logical file,
then an IO_ADVISE for iar_offset 0 means iar_offset 10000. then an IO_ADVISE for iar_offset 0 means iar_offset 10000.
E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS
for iar_offset 0 really means iar_offset 0 in the logical file, then for iar_offset 0 really means iar_offset 0 in the logical file, then
skipping to change at page 74, line 14 skipping to change at page 75, line 41
If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and
NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request
sent to the data server with a byte range that overlaps stripe unit sent to the data server with a byte range that overlaps stripe unit
that the data server does not serve MUST NOT result in the status that the data server does not serve MUST NOT result in the status
NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and
if the server applies IO_ADVISE hints on any stripe units that if the server applies IO_ADVISE hints on any stripe units that
overlap with the specified range, those hints SHOULD be indicated in overlap with the specified range, those hints SHOULD be indicated in
the response. the response.
12.8.6. Number of Supported File Segments 13.8.6. Number of Supported File Segments
In theory IO_ADVISE allows a client and server to support multiple In theory IO_ADVISE allows a client and server to support multiple
file segments, meaning that different, possibly overlapping, byte file segments, meaning that different, possibly overlapping, byte
ranges of the same open file reference will support different hints. ranges of the same open file reference will support different hints.
This is not practical, and in general the server will support just This is not practical, and in general the server will support just
one set of hints, and these will apply to the entire file. However, one set of hints, and these will apply to the entire file. However,
there are some hints that very ephemeral, and are essentially amount there are some hints that very ephemeral, and are essentially amount
to one time instructions to the NFS server, which will be forgotten to one time instructions to the NFS server, which will be forgotten
momentarily after IO_ADVISE is executed. momentarily after IO_ADVISE is executed.
skipping to change at page 75, line 9 skipping to change at page 76, line 33
o IO_ADVISE4_NOREUSE o IO_ADVISE4_NOREUSE
The following hints are modifiers to all other hints, and will apply The following hints are modifiers to all other hints, and will apply
to the entire file and/or to a one time instruction on the specified to the entire file and/or to a one time instruction on the specified
byte range: byte range:
o IO_ADVISE4_READ o IO_ADVISE4_READ
o IO_ADVISE4_WRITE o IO_ADVISE4_WRITE
12.8.7. Possible Additional Hint - IO_ADVISE4_RECENTLY_USED 13.9. Changes to Operation 51: LAYOUTRETURN
IO_ADVISE4_RECENTLY_USED The client has recently accessed the byte
range in its own cache. This informs the server that the data in
the byte range remains important to the client. When the server
reaches resource exhaustion, knowing which data is more important
allows the server to make better choices about which data to, for
example purge from a cache, or move to secondary storage. It also
informs the server which delegations are more important, since if
delegations are working correctly, once delegated to a client, a
server might never receive another I/O request for the file.
A use case for this hint is that of the NFS client or application
restart. In the event of restart, the app's/client's cache will be
cold and it will need to fill it from the server. If the server is
maintaining a list (LRU most likely) of byte ranges tagged with
IO_ADVISE4_RECENTLY_USED, then the server could have stored the data
in these ranges into a storage medium that is less expensive than
DRAM, and faster than random access magnetic or optical media, such
as flash. This allows the end to end application to storage system
to co-operate to meet a service level agreement/objective contracted
to the end user by the IT provider.
On the other side, this is effectively a hint regarding multi-level
caching, and it may be more useful to specify a more formal multi-
level caching system. In addition, the action to be taken by the
server file system with this hint, and hence its usefulness, is
unclear. For example, as most clients already cache data that they
know is important, having this data cached twice may be unnecessary.
In fact, substantial performance improvements have been demonstrated
by making caches more exclusive between each other [25], not the
other way around. This means that there is a strong argument to be
made that servers should immediately purge the described cached data
upon receiving this hint. Other work showed that even infinite sized
secondary caches can be largely ineffective [26], but this of course
is subject to the workload.
12.9. Changes to Operation 51: LAYOUTRETURN
12.9.1. Introduction 13.9.1. Introduction
In the pNFS description provided in [2], the client is not enabled to In the pNFS description provided in [2], the client is not enabled to
relay an error code from the DS to the MDS. In the specification of relay an error code from the DS to the MDS. In the specification of
the Objects-Based Layout protocol [8], use is made of the opaque the Objects-Based Layout protocol [9], use is made of the opaque
lrf_body field of the LAYOUTRETURN argument to do such a relaying of lrf_body field of the LAYOUTRETURN argument to do such a relaying of
error codes. In this section, we define a new data structure to error codes. In this section, we define a new data structure to
enable the passing of error codes back to the MDS and provide some enable the passing of error codes back to the MDS and provide some
guidelines on what both the client and MDS should expect in such guidelines on what both the client and MDS should expect in such
circumstances. circumstances.
There are two broad classes of errors, transient and persistent. The There are two broad classes of errors, transient and persistent. The
client SHOULD strive to only use this new mechanism to report client SHOULD strive to only use this new mechanism to report
persistent errors. It MUST be able to deal with transient issues by persistent errors. It MUST be able to deal with transient issues by
itself. Also, while the client might consider an issue to be itself. Also, while the client might consider an issue to be
skipping to change at page 76, line 29 skipping to change at page 77, line 17
hard error. The MDS on the other hand, is waiting for the client to hard error. The MDS on the other hand, is waiting for the client to
report such an error. For it, the mission is accomplished in that report such an error. For it, the mission is accomplished in that
the client has returned a layout that the MDS had most likley the client has returned a layout that the MDS had most likley
recalled. recalled.
The existing LAYOUTRETURN operation is extended by introducing a new The existing LAYOUTRETURN operation is extended by introducing a new
data structure to report errors, layoutreturn_device_error4. Also, data structure to report errors, layoutreturn_device_error4. Also,
layoutreturn_device_error4 is introduced to enable an array of errors layoutreturn_device_error4 is introduced to enable an array of errors
to be reported. to be reported.
12.9.2. ARGUMENT 13.9.2. ARGUMENT
The ARGUMENT specification of the LAYOUTRETURN operation in section The ARGUMENT specification of the LAYOUTRETURN operation in section
18.44.1 of [2] is augmented by the following XDR code [24]: 18.44.1 of [2] is augmented by the following XDR code [23]:
struct layoutreturn_device_error4 { struct layoutreturn_device_error4 {
deviceid4 lrde_deviceid; deviceid4 lrde_deviceid;
nfsstat4 lrde_status; nfsstat4 lrde_status;
nfs_opnum4 lrde_opnum; nfs_opnum4 lrde_opnum;
}; };
struct layoutreturn_error_report4 { struct layoutreturn_error_report4 {
layoutreturn_device_error4 lrer_errors<>; layoutreturn_device_error4 lrer_errors<>;
}; };
12.9.3. RESULT 13.9.3. RESULT
The RESULT of the LAYOUTRETURN operation is unchanged; see section The RESULT of the LAYOUTRETURN operation is unchanged; see section
18.44.2 of [2]. 18.44.2 of [2].
12.9.4. DESCRIPTION 13.9.4. DESCRIPTION
The following text is added to the end of the LAYOUTRETURN operation The following text is added to the end of the LAYOUTRETURN operation
DESCRIPTION in section 18.44.3 of [2]. DESCRIPTION in section 18.44.3 of [2].
When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
then if the lrf_body field is NULL, it indicates to the MDS that the then if the lrf_body field is NULL, it indicates to the MDS that the
client experienced no errors. If lrf_body is non-NULL, then the client experienced no errors. If lrf_body is non-NULL, then the
field references error information which is layout type specific. field references error information which is layout type specific.
I.e., the Objects-Based Layout protocol can continue to utilize I.e., the Objects-Based Layout protocol can continue to utilize
lrf_body as specified in [8]. For both Files-Based Layouts, the lrf_body as specified in [9]. For both Files-Based Layouts, the
field references a layoutreturn_device_error4, which contains an field references a layoutreturn_device_error4, which contains an
array of layoutreturn_device_error4. array of layoutreturn_device_error4.
Each individual layoutreturn_device_error4 descibes a single error Each individual layoutreturn_device_error4 descibes a single error
associated with a DS, which is identfied via lrde_deviceid. The associated with a DS, which is identfied via lrde_deviceid. The
operation which returned the error is identified via lrde_opnum. operation which returned the error is identified via lrde_opnum.
Finally the NFS error value (nfsstat4) encountered is provided via Finally the NFS error value (nfsstat4) encountered is provided via
lrde_status and may consist of the following error codes: lrde_status and may consist of the following error codes:
NFS4_OKAY: No issues were found for this device. NFS4_OKAY: No issues were found for this device.
NFS4ERR_NXIO: The client was unable to establish any communication NFS4ERR_NXIO: The client was unable to establish any communication
with the DS. with the DS.
NFS4ERR_*: The client was able to establish communication with the NFS4ERR_*: The client was able to establish communication with the
DS and is returning one of the allowed error codes for the DS and is returning one of the allowed error codes for the
operation denoted by lrde_opnum. operation denoted by lrde_opnum.
12.9.5. IMPLEMENTATION 13.9.5. IMPLEMENTATION
The following text is added to the end of the LAYOUTRETURN operation The following text is added to the end of the LAYOUTRETURN operation
IMPLEMENTATION in section 18.4.4 of [2]. IMPLEMENTATION in section 18.4.4 of [2].
A client that expects to use pNFS for a mounted filesystem SHOULD A client that expects to use pNFS for a mounted filesystem SHOULD
check for pNFS support at mount time. This check SHOULD be performed check for pNFS support at mount time. This check SHOULD be performed
by sending a GETDEVICELIST operation, followed by layout-type- by sending a GETDEVICELIST operation, followed by layout-type-
specific checks for accessibility of each storage device returned by specific checks for accessibility of each storage device returned by
GETDEVICELIST. If the NFS server does not support pNFS, the GETDEVICELIST. If the NFS server does not support pNFS, the
GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP
skipping to change at page 78, line 16 skipping to change at page 78, line 49
When an I/O fails to a storage device, the client SHOULD retry the When an I/O fails to a storage device, the client SHOULD retry the
failed I/O via the MDS. In this situation, before retrying the I/O, failed I/O via the MDS. In this situation, before retrying the I/O,
the client SHOULD return the layout, or the affected portion thereof, the client SHOULD return the layout, or the affected portion thereof,
and SHOULD indicate which storage device or devices was problematic. and SHOULD indicate which storage device or devices was problematic.
If the client does not do this, the MDS may issue a layout recall If the client does not do this, the MDS may issue a layout recall
callback in order to perform the retried I/O. callback in order to perform the retried I/O.
The client needs to be cognizant that since this error handling is The client needs to be cognizant that since this error handling is
optional in the MDS, the MDS may silently ignore this functionality. optional in the MDS, the MDS may silently ignore this functionality.
Also, as the MDS may consider some issues the client reports to be Also, as the MDS may consider some issues the client reports to be
expected (see Section 12.9.1), the client might find it difficult to expected (see Section 13.9.1), the client might find it difficult to
detect a MDS which has not implemented error handling via detect a MDS which has not implemented error handling via
LAYOUTRETURN. LAYOUTRETURN.
If an MDS is aware that a storage device is proving problematic to a If an MDS is aware that a storage device is proving problematic to a
client, the MDS SHOULD NOT include that storage device in any pNFS client, the MDS SHOULD NOT include that storage device in any pNFS
layouts sent to that client. If the MDS is aware that a storage layouts sent to that client. If the MDS is aware that a storage
device is affecting many clients, then the MDS SHOULD NOT include device is affecting many clients, then the MDS SHOULD NOT include
that storage device in any pNFS layouts sent out. Clients must still that storage device in any pNFS layouts sent out. Clients must still
be aware that the MDS might not have any choice in using the storage be aware that the MDS might not have any choice in using the storage
device, i.e., there might only be one possible layout for the system. device, i.e., there might only be one possible layout for the system.
skipping to change at page 78, line 45 skipping to change at page 79, line 30
using the problematic storage devices in layouts for that client, but using the problematic storage devices in layouts for that client, but
the MDS is not required to indefinitely retain per-client storage the MDS is not required to indefinitely retain per-client storage
device error information. An MDS is also not required to device error information. An MDS is also not required to
automatically reinstate use of a previously problematic storage automatically reinstate use of a previously problematic storage
device; administrative intervention may be required instead. device; administrative intervention may be required instead.
A client MAY perform I/O via the MDS even when the client holds a A client MAY perform I/O via the MDS even when the client holds a
layout that covers the I/O; servers MUST support this client layout that covers the I/O; servers MUST support this client
behavior, and MAY recall layouts as needed to complete I/Os. behavior, and MAY recall layouts as needed to complete I/Os.
12.10. Operation 65: READ_PLUS 13.10. Operation 65: READ_PLUS
READ_PLUS is a new read operation which allows NFS clients to avoid READ_PLUS is a new read operation which allows NFS clients to avoid
reading holes in a sparse file and to efficiently transfer ADBs. reading holes in a sparse file and to efficiently transfer ADBs.
READ_PLUS is guaranteed to perform no worse than READ, and can
dramatically improve performance with sparse files.
READ_PLUS supports all the features of the existing NFSv4.1 READ READ_PLUS supports all the features of the existing NFSv4.1 READ
operation [2] and adds a simple yet significant extension to the operation [2] but also extends the response to avoid returning data
format of its response. The change allows the client to avoid for portions of the file which are either initialized and contain no
returning data for portions of the file which are either initialized backing store or if the result would appear to be so. I.e., if the
and contain no backing store or if the result would appear to be so. result was a data block composed entirely of zeros, then it is easier
I.e., if the result was a data block composed entirely of zeros, then to return a hole. Returning data blocks of unitialized data wastes
it is easier to return a hole. Returning data blocks of unitialized computational and network resources, thus reducing performance.
data wastes computational and network resources, thus reducing READ_PLUS uses a new result structure that tells the client that the
performance. READ_PLUS uses a new result structure that tells the result is all zeroes AND the byte-range of the hole in which the
client that the result is all zeroes AND the byte-range of the hole request was made.
in which the request was made.
If the client sends a READ operation, it is explicitly stating that If the client sends a READ operation, it is explicitly stating that
it is neither supporting sparse files or ADBs. So if a READ occurs it is neither supporting sparse files nor ADBs. So if a READ occurs
on a sparse ADB or file, then the server must expand such data to be on a sparse ADB or file, then the server must expand such data to be
raw bytes. If a READ occurs in the middle of a hole or ADB, the raw bytes. If a READ occurs in the middle of a hole or ADB, the
server can only send back bytes starting from that offset. server can only send back bytes starting from that offset.
Such an operation is inefficient for transfer of sparse sections of Such an operation is inefficient for transfer of sparse sections of
the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead, the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead,
a client should issue READ_PLUS. Note that as the client has no a a client should issue READ_PLUS. Note that as the client has no a
priori knowledge of whether an ADB is present or not, it should priori knowledge of whether either an ADB or a hole is present or
always use READ_PLUS. not, it should always use READ_PLUS.
12.10.1. ARGUMENT 13.10.1. ARGUMENT
struct READ_PLUS4args { struct READ_PLUS4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 rpa_stateid; stateid4 rpa_stateid;
offset4 rpa_offset; offset4 rpa_offset;
count4 rpa_count; count4 rpa_count;
}; };
12.10.2. RESULT 13.10.2. RESULT
union read_plus_content switch (data_content4 content) { union read_plus_content switch (data_content4 content) {
case NFS4_CONTENT_DATA: case NFS4_CONTENT_DATA:
opaque rpc_data<>; opaque rpc_data<>;
case NFS4_CONTENT_APP_BLOCK: case NFS4_CONTENT_APP_BLOCK:
app_data_block4 rpc_block; app_data_block4 rpc_block;
case NFS4_CONTENT_HOLE: case NFS4_CONTENT_HOLE:
data_info4 rpc_hole; data_info4 rpc_hole;
default: default:
void; void;
skipping to change at page 80, line 33 skipping to change at page 80, line 45
read_plus_content rpr_contents<>; read_plus_content rpr_contents<>;
}; };
union READ_PLUS4res switch (nfsstat4 status) { union READ_PLUS4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
read_plus_res4 resok4; read_plus_res4 resok4;
default: default:
void; void;
}; };
12.10.3. DESCRIPTION 13.10.3. DESCRIPTION
The READ_PLUS operation is based upon the NFSv4.1 READ operation [2], The READ_PLUS operation is based upon the NFSv4.1 READ operation [2]
and similarly reads data from the regular file identified by the and similarly reads data from the regular file identified by the
current filehandle. current filehandle.
The client provides a rpa_offset of where the READ_PLUS is to start The client provides a rpa_offset of where the READ_PLUS is to start
and a rpa_count of how many bytes are to be read. A rpa_offset of and a rpa_count of how many bytes are to be read. A rpa_offset of
zero means to read data starting at the beginning of the file. If zero means to read data starting at the beginning of the file. If
rpa_offset is greater than or equal to the size of the file, the rpa_offset is greater than or equal to the size of the file, the
status NFS4_OK is returned with di_length (the data length) set to status NFS4_OK is returned with di_length (the data length) set to
zero and eof set to TRUE. READ_PLUS is subject to access permissions zero and eof set to TRUE. READ_PLUS is subject to access permissions
checking. checking.
The READ_PLUS result is comprised of an array of rpr_contents, each The READ_PLUS result is comprised of an array of rpr_contents, each
of which describe a data_content4 type of data. For NFSv4.2, the of which describe a data_content4 type of data. For NFSv4.2, the
allowed values are data, ADB, and hole. A server is required to allowed values are data, ADB, and hole. A server is required to
support the data type, but not ADB nor hole. Both an ADB and a hole support the data type, but neither ADB nor hole. Both an ADB and a
must be returned in its entirety - clients must be prepared to get hole must be returned in its entirety - clients must be prepared to
more information than they requested. get more information than they requested.
READ_PLUS has to support all of the errors which are returned by READ
plus NFS4ERR_UNION_NOTSUPP. If the client asks for a hole and the
server does not support that arm of the discriminated union, but does
support one or more additional arms, it can signal to the client that
it supports the operation, but not the arm with
NFS4ERR_UNION_NOTSUPP.
If the data to be returned is comprised entirely of zeros, then the If the data to be returned is comprised entirely of zeros, then the
server may elect to return that data as a hole. The server server may elect to return that data as a hole. The server
differentiates this to the client by setting di_allocated to TRUE in differentiates this to the client by setting di_allocated to TRUE in
this case. Note that in such a scenario, the server is not required this case. Note that in such a scenario, the server is not required
to determine the full extent of the "hole" - it does not need to to determine the full extent of the "hole" - it does not need to
determine where the zeros start and end. determine where the zeros start and end.
The server may elect to return adjacent elements of the same type. The server may elect to return adjacent elements of the same type.
For example, the guard pattern or block size of an ADB might change, For example, the guard pattern or block size of an ADB might change,
skipping to change at page 82, line 24 skipping to change at page 82, line 43
For a READ_PLUS with a stateid value of all bits equal to zero, the For a READ_PLUS with a stateid value of all bits equal to zero, the
server MAY allow the READ_PLUS to be serviced subject to mandatory server MAY allow the READ_PLUS to be serviced subject to mandatory
byte-range locks or the current share deny modes for the file. For a byte-range locks or the current share deny modes for the file. For a
READ_PLUS with a stateid value of all bits equal to one, the server READ_PLUS with a stateid value of all bits equal to one, the server
MAY allow READ_PLUS operations to bypass locking checks at the MAY allow READ_PLUS operations to bypass locking checks at the
server. server.
On success, the current filehandle retains its value. On success, the current filehandle retains its value.
12.10.4. IMPLEMENTATION 13.10.4. IMPLEMENTATION
If the server returns a short read, then the client should send If the server returns a short read, then the client should send
another READ_PLUS to get the remaining data. A server may return another READ_PLUS to get the remaining data. A server may return
less data than requested under several circumstances. The file may less data than requested under several circumstances. The file may
have been truncated by another client or perhaps on the server have been truncated by another client or perhaps on the server
itself, changing the file size from what the requesting client itself, changing the file size from what the requesting client
believes to be the case. This would reduce the actual amount of data believes to be the case. This would reduce the actual amount of data
available to the client. It is possible that the server reduced the available to the client. It is possible that the server reduced the
transfer size and so return a short read result. Server resource transfer size and so return a short read result. Server resource
exhaustion may also occur in a short read. exhaustion may also occur in a short read.
skipping to change at page 83, line 9 skipping to change at page 83, line 28
being read, the delegation must be recalled, and the operation cannot being read, the delegation must be recalled, and the operation cannot
proceed until that delegation is returned or revoked. Except where proceed until that delegation is returned or revoked. Except where
this happens very quickly, one or more NFS4ERR_DELAY errors will be this happens very quickly, one or more NFS4ERR_DELAY errors will be
returned to requests made while the delegation remains outstanding. returned to requests made while the delegation remains outstanding.
Normally, delegations will not be recalled as a result of a READ_PLUS Normally, delegations will not be recalled as a result of a READ_PLUS
operation since the recall will occur as a result of an earlier OPEN. operation since the recall will occur as a result of an earlier OPEN.
However, since it is possible for a READ_PLUS to be done with a However, since it is possible for a READ_PLUS to be done with a
special stateid, the server needs to check for this case even though special stateid, the server needs to check for this case even though
the client should have done an OPEN previously. the client should have done an OPEN previously.
12.10.4.1. Additional pNFS Implementation Information 13.10.4.1. Additional pNFS Implementation Information
[[Comment.11: We need to go over this section. --TH]] With pNFS, the
semantics of using READ_PLUS remains the same. Any data server MAY
return a READ_HOLE result for a READ_PLUS request that it receives.
When a data server chooses to return a READ_HOLE result, it has the
option of returning hole information for the data stored on that data
server (as defined by the data layout), but it MUST not return a
nfs_readplusreshole structure with a byte range that includes data
managed by another data server.
1. Data servers that cannot determine hole information SHOULD return With pNFS, the semantics of using READ_PLUS remains the same. Any
HOLE_NOINFO. data server MAY return a hole or ADB result for a READ_PLUS request
that it receives.
2. Data servers that can obtain hole information for the parts of When a data server chooses to return a hole result, it has the option
the file stored on that data server, the data server SHOULD of returning hole information for the data stored on that data server
return HOLE_INFO and the byte range of the hole stored on that (as defined by the data layout), but it MUST not return results for a
data server. byte range that includes data managed by another data server. Data
servers that can obtain hole information for the parts of the file
stored on that data server, the data server SHOULD return HOLE_INFO
and the byte range of the hole stored on that data server.
A data server should do its best to return as much information about A data server should do its best to return as much information about
a hole as is feasible without having to contact the metadata server. a hole as is feasible without having to contact the metadata server.
If communication with the metadata server is required, then every If communication with the metadata server is required, then every
attempt should be taken to minimize the number of requests. attempt should be taken to minimize the number of requests.
If mandatory locking is enforced, then the data server must also If mandatory locking is enforced, then the data server must also
ensure that to return only information for a Hole that is within the ensure that to return only information for a Hole that is within the
owner's locked byte range. owner's locked byte range.
12.10.5. READ_PLUS with Sparse Files Example 13.10.5. READ_PLUS with Sparse Files Example
The following table describes a sparse file. For each byte range, The following table describes a sparse file. For each byte range,
the file contains either non-zero data or a hole. In addition, the the file contains either non-zero data or a hole. In addition, the
server in this example uses a Hole Threshold of 32K. server in this example uses a Hole Threshold of 32K.
+-------------+----------+ +-------------+----------+
| Byte-Range | Contents | | Byte-Range | Contents |
+-------------+----------+ +-------------+----------+
| 0-15999 | Hole | | 0-15999 | Hole |
| 16K-31999 | Non-Zero | | 16K-31999 | Non-Zero |
| 32K-255999 | Hole | | 32K-255999 | Hole |
| 256K-287999 | Non-Zero | | 256K-287999 | Non-Zero |
| 288K-353999 | Hole | | 288K-353999 | Hole |
| 354K-417999 | Non-Zero | | 354K-417999 | Non-Zero |
+-------------+----------+ +-------------+----------+
Table 3 Table 4
Under the given circumstances, if a client was to read the file from
beginning to end with a max read size of 64K, the following will be
the result. This assumes the client has already opened the file,
acquired a valid stateid ('s' in the example), and just needs to
issue READ_PLUS requests. [[Comment.12: Change the results to match
array results. --TH]]
1. READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, data<>[32K]. Under the given circumstances, if a client was to read from the file
Return a short read, as the last half of the request was all with a max read size of 64K, the following will be the results for
zeroes. Note that the first hole is read back as all zeros as it the given READ_PLUS calls. This assumes the client has already
is below the Hole Threshhold. opened the file, acquired a valid stateid ('s' in the example), and
just needs to issue READ_PLUS requests.
2. READ_PLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 1. READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, <data[0,32K],
nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range hole[32K,224K]>. Since the first hole is less than the server's
was all zeros, and the current hole begins at offset 32K and is Hole Threshhold, the first 32K of the file is returned as data
224K in length. and the remaining 32K is returned as a hole which actually
extends to 256K.
3. READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 2. READ_PLUS(s, 32K, 64K) --> NFS_OK, eof = false, <hole[32K,224K]>
eof = false, data<>[32K]. Return a short read, as the last half The requested range was all zeros, and the current hole begins at
of the request was all zeroes. offset 32K and is 224K in length. Note that the client should
not have followed up the previous READ_PLUS request with this one
as the hole information from the previous call extended past what
the client was requesting.
4. READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 3. READ_PLUS(s, 256K, 64K) --> NFS_OK, eof = false, <data[256K,
nfs_readplusreshole(HOLE_INFO)(288K, 66K). 288K], hole[288K, 354K]>. Returns an array of the 32K data and
the hole which extends to 354K.
5. READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 4. READ_PLUS(s, 354K, 64K) --> NFS_OK, eof = true, <data[354K,
eof = true, data<>[64K]. 418K]>. Returns the final 64K of data and informs the client
there is no more data in the file.
12.11. Operation 66: SEEK 13.11. Operation 66: SEEK
SEEK is an operation that allows a client to determine the location SEEK is an operation that allows a client to determine the location
of the next data_content4 in a file. of the next data_content4 in a file.
12.11.1. ARGUMENT 13.11.1. ARGUMENT
struct SEEK4args { struct SEEK4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 sa_stateid; stateid4 sa_stateid;
offset4 sa_offset; offset4 sa_offset;
data_content4 sa_what; data_content4 sa_what;
}; };
12.11.2. RESULT 13.11.2. RESULT
union seek_content switch (data_content4 content) { union seek_content switch (data_content4 content) {
case NFS4_CONTENT_DATA: case NFS4_CONTENT_DATA:
data_info4 sc_data; data_info4 sc_data;
case NFS4_CONTENT_APP_BLOCK: case NFS4_CONTENT_APP_BLOCK:
app_data_block4 sc_block; app_data_block4 sc_block;
case NFS4_CONTENT_HOLE: case NFS4_CONTENT_HOLE:
data_info4 sc_hole; data_info4 sc_hole;
default: default:
void; void;
skipping to change at page 85, line 39 skipping to change at page 85, line 44
seek_content sr_contents; seek_content sr_contents;
}; };
union SEEK4res switch (nfsstat4 status) { union SEEK4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
seek_res4 resok4; seek_res4 resok4;
default: default:
void; void;
}; };
12.11.3. DESCRIPTION 13.11.3. DESCRIPTION
From the given sa_offset, find the next data_content4 of type sa_what From the given sa_offset, find the next data_content4 of type sa_what
in the file. For either a hole or ADB, this must return the in the file. For either a hole or ADB, this must return the
data_content4 in its entirety. For data, it must not return the data_content4 in its entirety. For data, it must not return the
actual data. actual data.
SEEK must follow the same rules for stateids as READ_PLUS SEEK must follow the same rules for stateids as READ_PLUS
(Section 12.10.3). (Section 13.10.3).
If the server could not find a corresponding sa_what, then the status If the server could not find a corresponding sa_what, then the status
would still be NFS4_OK, but sr_eof would be TRUE. The sr_contents would still be NFS4_OK, but sr_eof would be TRUE. The sr_contents
would contain a zero-ed out content of the appropriate type. would contain a zero-ed out content of the appropriate type.
13. NFSv4.2 Callback Operations 14. NFSv4.2 Callback Operations
13.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's 14.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's
Attributes Changed Attributes Changed
13.1.1. ARGUMENTS 14.1.1. ARGUMENTS
struct CB_ATTR_CHANGED4args { struct CB_ATTR_CHANGED4args {
nfs_fh4 acca_fh; nfs_fh4 acca_fh;
bitmap4 acca_critical; bitmap4 acca_critical;
bitmap4 acca_info; bitmap4 acca_info;
}; };
13.1.2. RESULTS 14.1.2. RESULTS
struct CB_ATTR_CHANGED4res { struct CB_ATTR_CHANGED4res {
nfsstat4 accr_status; nfsstat4 accr_status;
}; };
13.1.3. DESCRIPTION 14.1.3. DESCRIPTION
The CB_ATTR_CHANGED callback operation is used by the server to The CB_ATTR_CHANGED callback operation is used by the server to
indicate to the client that the file's attributes have been modified indicate to the client that the file's attributes have been modified
on the server. The server does not convey how the attributes have on the server. The server does not convey how the attributes have
changed, just that they have been modified. The server can inform changed, just that they have been modified. The server can inform
the client about both critical and informational attribute changes in the client about both critical and informational attribute changes in
the bitmask arguments. The client SHOULD query the server about all the bitmask arguments. The client SHOULD query the server about all
attributes set in acca_critical. For all changes reflected in attributes set in acca_critical. For all changes reflected in
acca_info, the client can decide whether or not it wants to poll the acca_info, the client can decide whether or not it wants to poll the
server. server.
The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set
in acca_critical is the method used by the server to indicate that in acca_critical is the method used by the server to indicate that
the MAC label for the file referenced by acca_fh has changed. In the MAC label for the file referenced by acca_fh has changed. In
many ways, the server does not care about the result returned by the many ways, the server does not care about the result returned by the
client. client.
13.2. Operation 15: CB_COPY - Report results of a server-side copy 14.2. Operation 15: CB_COPY - Report results of a server-side copy
13.2.1. ARGUMENT
14.2.1. ARGUMENT
union copy_info4 switch (nfsstat4 cca_status) { union copy_info4 switch (nfsstat4 cca_status) {
case NFS4_OK: case NFS4_OK:
void; void;
default: default:
length4 cca_bytes_copied; length4 cca_bytes_copied;
}; };
struct CB_COPY4args { struct CB_COPY4args {
nfs_fh4 cca_fh; nfs_fh4 cca_fh;
stateid4 cca_stateid; stateid4 cca_stateid;
copy_info4 cca_copy_info; copy_info4 cca_copy_info;
}; };
13.2.2. RESULT 14.2.2. RESULT
struct CB_COPY4res { struct CB_COPY4res {
nfsstat4 ccr_status; nfsstat4 ccr_status;
}; };
13.2.3. DESCRIPTION 14.2.3. DESCRIPTION
CB_COPY is used for both intra- and inter-server asynchronous copies. CB_COPY is used for both intra- and inter-server asynchronous copies.
The CB_COPY callback informs the client of the result of an The CB_COPY callback informs the client of the result of an
asynchronous server-side copy. This operation is sent by the asynchronous server-side copy. This operation is sent by the
destination server to the client in a CB_COMPOUND request. The copy destination server to the client in a CB_COMPOUND request. The copy
is identified by the filehandle and stateid arguments. The result is is identified by the filehandle and stateid arguments. The result is
indicated by the status field. If the copy failed, cca_bytes_copied indicated by the status field. If the copy failed, cca_bytes_copied
contains the number of bytes copied before the failure occurred. The contains the number of bytes copied before the failure occurred. The
cca_bytes_copied value indicates the number of bytes copied but not cca_bytes_copied value indicates the number of bytes copied but not
which specific bytes have been copied. which specific bytes have been copied.
skipping to change at page 88, line 8 skipping to change at page 88, line 9
If the client supports the COPY operation, the client is REQUIRED to If the client supports the COPY operation, the client is REQUIRED to
support the CB_COPY operation. support the CB_COPY operation.
The CB_COPY operation may fail for the following reasons (this is a The CB_COPY operation may fail for the following reasons (this is a
partial list): partial list):
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the NFS4ERR_NOTSUPP: The copy offload operation is not supported by the
NFS client receiving this request. NFS client receiving this request.
14. IANA Considerations 15. IANA Considerations
This section uses terms that are defined in [27]. This section uses terms that are defined in [24].
15. References 16. References
15.1. Normative References 16.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", March 1997. Levels", March 1997.
[2] Shepler, S., Eisler, M., and D. Noveck, "Network File System [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 Protocol", RFC 5661, (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
January 2010. January 2010.
[3] Haynes, T., "Network File System (NFS) Version 4 Minor Version [3] Haynes, T., "Network File System (NFS) Version 4 Minor Version
2 External Data Representation Standard (XDR) Description", 2 External Data Representation Standard (XDR) Description",
skipping to change at page 88, line 39 skipping to change at page 88, line 40
January 2005. January 2005.
[5] Haynes, T. and N. Williams, "Remote Procedure Call (RPC) [5] Haynes, T. and N. Williams, "Remote Procedure Call (RPC)
Security Version 3", draft-williams-rpcsecgssv3 (work in Security Version 3", draft-williams-rpcsecgssv3 (work in
progress), 2011. progress), 2011.
[6] The Open Group, "Section 'posix_fadvise()' of System Interfaces [6] The Open Group, "Section 'posix_fadvise()' of System Interfaces
of The Open Group Base Specifications Issue 6, IEEE Std 1003.1, of The Open Group Base Specifications Issue 6, IEEE Std 1003.1,
2004 Edition", 2004. 2004 Edition", 2004.
[7] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol [7] Haynes, T., "Requirements for Labeled NFS",
draft-ietf-nfsv4-labreqs-00 (work in progress).
[8] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
Specification", RFC 2203, September 1997. Specification", RFC 2203, September 1997.
[8] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel [9] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel
NFS (pNFS) Operations", RFC 5664, January 2010. NFS (pNFS) Operations", RFC 5664, January 2010.
[9] Shepler, S., Eisler, M., and D. Noveck, "Network File System 16.2. Informative References
(NFS) Version 4 Minor Version 1 External Data Representation
Standard (XDR) Description", RFC 5662, January 2010.
[10] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS)
Block/Volume Layout", RFC 5663, January 2010.
15.2. Informative References
[11] Haynes, T. and D. Noveck, "Network File System (NFS) version 4 [10] Haynes, T. and D. Noveck, "Network File System (NFS) version 4
Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress), Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
March 2011. March 2011.
[12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, [11] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
"NSDB Protocol for Federated Filesystems", "NSDB Protocol for Federated Filesystems",
draft-ietf-nfsv4-federated-fs-protocol (Work In Progress), draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
2010. 2010.
[13] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, [12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
"Administration Protocol for Federated Filesystems", "Administration Protocol for Federated Filesystems",
draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010. draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010.
[14] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., [13] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
HTTP/1.1", RFC 2616, June 1999. HTTP/1.1", RFC 2616, June 1999.
[15] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, [14] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9,
RFC 959, October 1985. RFC 959, October 1985.
[16] Simpson, W., "PPP Challenge Handshake Authentication Protocol [15] Simpson, W., "PPP Challenge Handshake Authentication Protocol
(CHAP)", RFC 1994, August 1996. (CHAP)", RFC 1994, August 1996.
[17] VanDeBogart, S., Frost, C., and E. Kohler, "Reducing Seek [16] VanDeBogart, S., Frost, C., and E. Kohler, "Reducing Seek
Overhead with Application-Directed Prefetching", Proceedings of Overhead with Application-Directed Prefetching", Proceedings of
USENIX Annual Technical Conference , June 2009. USENIX Annual Technical Conference , June 2009.
[18] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of [17] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
Oracle Database Concepts 11g Release 1 (11.1)", January 2011. Oracle Database Concepts 11g Release 1 (11.1)", January 2011.
[19] Ashdown, L., "Chapter 15, Validating Database Files and [18] Ashdown, L., "Chapter 15, Validating Database Files and
Backups, of Oracle Database Backup and Recovery User's Guide Backups, of Oracle Database Backup and Recovery User's Guide
11g Release 1 (11.1)", August 2008. 11g Release 1 (11.1)", August 2008.
[20] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory [19] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory
Corruption of Solaris Internals", 2007. Corruption of Solaris Internals", 2007.
[21] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci- [20] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
Corruption in the Storage Stack", Proceedings of the 6th USENIX Corruption in the Storage Stack", Proceedings of the 6th USENIX
Symposium on File and Storage Technologies (FAST '08) , 2008. Symposium on File and Storage Technologies (FAST '08) , 2008.
[22] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide: [21] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide:
Deployment, configuration and administration of Red Hat Deployment, configuration and administration of Red Hat
Enterprise Linux 5, Edition 6", 2011. Enterprise Linux 5, Edition 6", 2011.
[23] Quigley, D. and J. Lu, "Registry Specification for MAC Security [22] Quigley, D. and J. Lu, "Registry Specification for MAC Security
Label Formats", draft-quigley-label-format-registry (work in Label Formats", draft-quigley-label-format-registry (work in
progress), 2011. progress), 2011.
[24] Eisler, M., "XDR: External Data Representation Standard", [23] Eisler, M., "XDR: External Data Representation Standard",
RFC 4506, May 2006. RFC 4506, May 2006.
[25] Wong, T. and J. Wilkes, "My cache or yours? Making storage more [24] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
exclusive", Proceedings of the USENIX Annual Technical
Conference , 2002.
[26] Muntz, D. and P. Honeyman, "Multi-level Caching in Distributed
File Systems", Proceedings of USENIX Annual Technical
Conference , 1992.
[27] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.
[28] Nowicki, B., "NFS: Network File System Protocol specification",
RFC 1094, March 1989.
[29] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3
Protocol Specification", RFC 1813, June 1995.
[30] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
RFC 1833, August 1995.
[31] Eisler, M., "NFS Version 2 and Version 3 Security Issues and
the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5",
RFC 2623, June 1999.
[32] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997.
[33] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624,
June 1999.
[34] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On-
line Database", RFC 3232, January 2002.
[35] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964,
June 1996.
[36] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
C., Eisler, M., and D. Noveck, "Network File System (NFS)
version 4 Protocol", RFC 3530, April 2003.
Appendix A. Acknowledgments Appendix A. Acknowledgments
For the pNFS Access Permissions Check, the original draft was by For the pNFS Access Permissions Check, the original draft was by
Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work
was influenced by discussions with Benny Halevy and Bruce Fields. A was influenced by discussions with Benny Halevy and Bruce Fields. A
review was done by Tom Haynes. review was done by Tom Haynes.
For the Sharing change attribute implementation details with NFSv4 For the Sharing change attribute implementation details with NFSv4
clients, the original draft was by Trond Myklebust. clients, the original draft was by Trond Myklebust.
 End of changes. 191 change blocks. 
597 lines changed or deleted 517 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/