draft-ietf-nfsv4-minorversion2-26.txt   draft-ietf-nfsv4-minorversion2-27.txt 
NFSv4 T. Haynes NFSv4 T. Haynes
Internet-Draft Primary Data Internet-Draft Primary Data
Intended status: Standards Track May 19, 2014 Intended status: Standards Track September 20, 2014
Expires: November 20, 2014 Expires: March 24, 2015
NFS Version 4 Minor Version 2 NFS Version 4 Minor Version 2
draft-ietf-nfsv4-minorversion2-26.txt draft-ietf-nfsv4-minorversion2-27.txt
Abstract Abstract
This Internet-Draft describes NFS version 4 minor version two, This Internet-Draft describes NFS version 4 minor version two,
focusing mainly on the protocol extensions made from NFS version 4 focusing mainly on the protocol extensions made from NFS version 4
minor version 0 and NFS version 4 minor version 1. Major extensions minor version 0 and NFS version 4 minor version 1. Major extensions
introduced in NFS version 4 minor version two include: Server Side introduced in NFS version 4 minor version two include: Server Side
Copy, Application I/O Advise, Space Reservations, Sparse Files, Copy, Application I/O Advise, Space Reservations, Sparse Files,
Application Data Blocks, and Labeled NFS. Application Data Blocks, and Labeled NFS.
skipping to change at page 1, line 41 skipping to change at page 1, line 41
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on November 20, 2014. This Internet-Draft will expire on March 24, 2015.
Copyright Notice Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 20 skipping to change at page 2, line 20
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 4 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 4
1.2. Scope of This Document . . . . . . . . . . . . . . . . . 5 1.2. Scope of This Document . . . . . . . . . . . . . . . . . 5
1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5
1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 5 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 5
1.4.1. Server Side Copy . . . . . . . . . . . . . . . . . . 5 1.4.1. Server Side Copy . . . . . . . . . . . . . . . . . . 5
1.4.2. Application I/O Advise . . . . . . . . . . . . . . . 5 1.4.2. Application I/O Advise . . . . . . . . . . . . . . . 6
1.4.3. Sparse Files . . . . . . . . . . . . . . . . . . . . 6 1.4.3. Sparse Files . . . . . . . . . . . . . . . . . . . . 6
1.4.4. Space Reservation . . . . . . . . . . . . . . . . . . 6 1.4.4. Space Reservation . . . . . . . . . . . . . . . . . . 6
1.4.5. Application Data Block (ADB) Support . . . . . . . . 6 1.4.5. Application Data Block (ADB) Support . . . . . . . . 6
1.4.6. Labeled NFS . . . . . . . . . . . . . . . . . . . . . 6 1.4.6. Labeled NFS . . . . . . . . . . . . . . . . . . . . . 6
1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 6 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 7
2. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 7 2. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 7
3. pNFS considerations for New Operations . . . . . . . . . . . 10 3. pNFS considerations for New Operations . . . . . . . . . . . 10
3.1. Atomicity for ALLOCATE and DEALLOCATE . . . . . . . . . . 10 3.1. Atomicity for ALLOCATE and DEALLOCATE . . . . . . . . . . 10
3.2. Sharing of stateids with NFSv4.1 . . . . . . . . . . . . 10 3.2. Sharing of stateids with NFSv4.1 . . . . . . . . . . . . 11
3.3. NFSv4.2 as a Storage Protocol in pNFS: the File Layout 3.3. NFSv4.2 as a Storage Protocol in pNFS: the File Layout
Type . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Type . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1. Operations Sent to NFSv4.2 Data Servers . . . . . . . 11 3.3.1. Operations Sent to NFSv4.2 Data Servers . . . . . . . 11
4. Server Side Copy . . . . . . . . . . . . . . . . . . . . . . 11 4. Server Side Copy . . . . . . . . . . . . . . . . . . . . . . 11
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 11 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 11
4.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 11 4.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 12
4.2.1. Copy Operations . . . . . . . . . . . . . . . . . . . 12 4.2.1. Copy Operations . . . . . . . . . . . . . . . . . . . 12
4.2.2. Requirements for Operations . . . . . . . . . . . . . 12 4.2.2. Requirements for Operations . . . . . . . . . . . . . 13
4.3. Requirements for Inter-Server Copy . . . . . . . . . . . 13 4.3. Requirements for Inter-Server Copy . . . . . . . . . . . 13
4.4. Locking the Files . . . . . . . . . . . . . . . . . . . . 13 4.4. Implementation Considerations . . . . . . . . . . . . . . 14
4.4.1. Locking the Files . . . . . . . . . . . . . . . . . . 14
4.4.2. Client Caches . . . . . . . . . . . . . . . . . . . . 14
4.5. Intra-Server Copy . . . . . . . . . . . . . . . . . . . . 14 4.5. Intra-Server Copy . . . . . . . . . . . . . . . . . . . . 14
4.6. Inter-Server Copy . . . . . . . . . . . . . . . . . . . . 15 4.6. Inter-Server Copy . . . . . . . . . . . . . . . . . . . . 16
4.7. Server-to-Server Copy Protocol . . . . . . . . . . . . . 19 4.7. Server-to-Server Copy Protocol . . . . . . . . . . . . . 20
4.7.1. Considerations on Selecting a Copy Protocol . . . . . 19 4.7.1. Considerations on Selecting a Copy Protocol . . . . . 20
4.7.2. Using NFSv4.x as the Copy Protocol . . . . . . . . . 19 4.7.2. Using NFSv4.x as the Copy Protocol . . . . . . . . . 20
4.7.3. Using an Alternative Copy Protocol . . . . . . . . . 19 4.7.3. Using an Alternative Copy Protocol . . . . . . . . . 20
4.8. netloc4 - Network Locations . . . . . . . . . . . . . . . 20 4.8. netloc4 - Network Locations . . . . . . . . . . . . . . . 21
4.9. Copy Offload Stateids . . . . . . . . . . . . . . . . . . 21 4.9. Copy Offload Stateids . . . . . . . . . . . . . . . . . . 22
4.10. Security Considerations . . . . . . . . . . . . . . . . . 21 4.10. Security Considerations . . . . . . . . . . . . . . . . . 22
4.10.1. Inter-Server Copy Security . . . . . . . . . . . . . 21 4.10.1. Inter-Server Copy Security . . . . . . . . . . . . . 22
5. Support for Application IO Hints . . . . . . . . . . . . . . 31
6. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . 31 5. Support for Application IO Hints . . . . . . . . . . . . . . 32
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 31 6. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 32 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 32
6.3. New Operations . . . . . . . . . . . . . . . . . . . . . 32 6.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 34
6.3.1. READ_PLUS . . . . . . . . . . . . . . . . . . . . . . 32 6.3. New Operations . . . . . . . . . . . . . . . . . . . . . 34
6.3.2. DEALLOCATE . . . . . . . . . . . . . . . . . . . . . 33 6.3.1. READ_PLUS . . . . . . . . . . . . . . . . . . . . . . 34
7. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 33 6.3.2. DEALLOCATE . . . . . . . . . . . . . . . . . . . . . 34
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 33 7. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 34
8. Application Data Block Support . . . . . . . . . . . . . . . 35 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 35
8.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 36 8. Application Data Block Support . . . . . . . . . . . . . . . 37
8.1.1. Data Block Representation . . . . . . . . . . . . . . 36 8.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 37
8.2. An Example of Detecting Corruption . . . . . . . . . . . 37 8.1.1. Data Block Representation . . . . . . . . . . . . . . 38
8.3. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 38 8.2. An Example of Detecting Corruption . . . . . . . . . . . 38
8.4. An Example of Zeroing Space . . . . . . . . . . . . . . . 39 8.3. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 40
9. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 39 8.4. An Example of Zeroing Space . . . . . . . . . . . . . . . 41
9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 39 9. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 40 9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 41
9.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 41 9.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 42
9.3.1. Delegations . . . . . . . . . . . . . . . . . . . . . 41 9.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 42
9.3.2. Permission Checking . . . . . . . . . . . . . . . . . 42 9.3.1. Delegations . . . . . . . . . . . . . . . . . . . . . 43
9.3.3. Object Creation . . . . . . . . . . . . . . . . . . . 42 9.3.2. Permission Checking . . . . . . . . . . . . . . . . . 43
9.3.4. Existing Objects . . . . . . . . . . . . . . . . . . 42 9.3.3. Object Creation . . . . . . . . . . . . . . . . . . . 44
9.3.5. Label Changes . . . . . . . . . . . . . . . . . . . . 42 9.3.4. Existing Objects . . . . . . . . . . . . . . . . . . 44
9.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 43 9.3.5. Label Changes . . . . . . . . . . . . . . . . . . . . 44
9.5. Discovery of Server Labeled NFS Support . . . . . . . . . 43 9.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 44
9.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 43 9.5. Discovery of Server Labeled NFS Support . . . . . . . . . 45
9.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 43 9.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 45
9.6.2. Guest Mode . . . . . . . . . . . . . . . . . . . . . 45 9.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 45
9.7. Security Considerations . . . . . . . . . . . . . . . . . 45 9.6.2. Guest Mode . . . . . . . . . . . . . . . . . . . . . 47
9.7. Security Considerations . . . . . . . . . . . . . . . . . 47
10. Sharing change attribute implementation details with NFSv4 10. Sharing change attribute implementation details with NFSv4
clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 46 10.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 48
11. Security Considerations . . . . . . . . . . . . . . . . . . . 46 11. Security Considerations . . . . . . . . . . . . . . . . . . . 48
12. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 46 12. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 48
12.1. Error Definitions . . . . . . . . . . . . . . . . . . . 47 12.1. Error Definitions . . . . . . . . . . . . . . . . . . . 49
12.1.1. General Errors . . . . . . . . . . . . . . . . . . . 47 12.1.1. General Errors . . . . . . . . . . . . . . . . . . . 49
12.1.2. Server to Server Copy Errors . . . . . . . . . . . . 47 12.1.2. Server to Server Copy Errors . . . . . . . . . . . . 49
12.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . 48 12.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . 50
12.2. New Operations and Their Valid Errors . . . . . . . . . 48 12.2. New Operations and Their Valid Errors . . . . . . . . . 50
12.3. New Callback Operations and Their Valid Errors . . . . . 52 12.3. New Callback Operations and Their Valid Errors . . . . . 54
13. New File Attributes . . . . . . . . . . . . . . . . . . . . . 52 13. New File Attributes . . . . . . . . . . . . . . . . . . . . . 54
13.1. New RECOMMENDED Attributes - List and Definition 13.1. New RECOMMENDED Attributes - List and Definition
References . . . . . . . . . . . . . . . . . . . . . . . 52 References . . . . . . . . . . . . . . . . . . . . . . . 54
13.2. Attribute Definitions . . . . . . . . . . . . . . . . . 53 13.2. Attribute Definitions . . . . . . . . . . . . . . . . . 55
14. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 56 14. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 57
15. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . 59 15. Modifications to NFSv4.1 Operations . . . . . . . . . . . . . 61
15.1. Operation 59: ALLOCATE - Reserve Space in A Region of a 15.1. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 61
File . . . . . . . . . . . . . . . . . . . . . . . . . . 59 15.2. Operation 48: GETDEVICELIST - Get All Device Mappings
15.2. Operation 60: COPY - Initiate a server-side copy . . . . 60 for a File System . . . . . . . . . . . . . . . . . . . 62
15.3. Operation 61: COPY_NOTIFY - Notify a source server of a 16. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . 63
future copy . . . . . . . . . . . . . . . . . . . . . . 65 16.1. Operation 59: ALLOCATE - Reserve Space in A Region of a
15.4. Modification to Operation 42: EXCHANGE_ID - Instantiate File . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Client ID . . . . . . . . . . . . . . . . . . . . . . . 66 16.2. Operation 60: COPY - Initiate a server-side copy . . . . 64
15.5. Operation 62: DEALLOCATE - Unreserve Space in a Region 16.3. Operation 61: COPY_NOTIFY - Notify a source server of a
of a File . . . . . . . . . . . . . . . . . . . . . . . 68 future copy . . . . . . . . . . . . . . . . . . . . . . 69
15.6. Operation 63: IO_ADVISE - Application I/O access pattern 16.4. Operation 62: DEALLOCATE - Unreserve Space in a Region
hints . . . . . . . . . . . . . . . . . . . . . . . . . 69 of a File . . . . . . . . . . . . . . . . . . . . . . . 70
15.7. Operation 64: LAYOUTERROR - Provide Errors for the 16.5. Operation 63: IO_ADVISE - Application I/O access pattern
Layout . . . . . . . . . . . . . . . . . . . . . . . . . 74 hints . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.8. Operation 65: LAYOUTSTATS - Provide Statistics for the 16.6. Operation 64: LAYOUTERROR - Provide Errors for the
Layout . . . . . . . . . . . . . . . . . . . . . . . . . 77 Layout . . . . . . . . . . . . . . . . . . . . . . . . . 77
15.9. Operation 66: OFFLOAD_CANCEL - Stop an Offloaded 16.7. Operation 65: LAYOUTSTATS - Provide Statistics for the
Operation . . . . . . . . . . . . . . . . . . . . . . . 78 Layout . . . . . . . . . . . . . . . . . . . . . . . . . 80
15.10. Operation 67: OFFLOAD_STATUS - Poll for Status of 16.8. Operation 66: OFFLOAD_CANCEL - Stop an Offloaded
Asynchronous Operation . . . . . . . . . . . . . . . . . 79 Operation . . . . . . . . . . . . . . . . . . . . . . . 81
15.11. Operation 68: READ_PLUS - READ Data or Holes from a File 80 16.9. Operation 67: OFFLOAD_STATUS - Poll for Status of
15.12. Operation 69: SEEK - Find the Next Data or Hole . . . . 84 Asynchronous Operation . . . . . . . . . . . . . . . . . 82
15.13. Operation 70: WRITE_SAME - WRITE an ADB Multiple Times 16.10. Operation 68: READ_PLUS - READ Data or Holes from a File 83
to a File . . . . . . . . . . . . . . . . . . . . . . . 85 16.11. Operation 69: SEEK - Find the Next Data or Hole . . . . 88
16. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 89 16.12. Operation 70: WRITE_SAME - WRITE an ADB Multiple Times
16.1. Operation 15: CB_OFFLOAD - Report results of an to a File . . . . . . . . . . . . . . . . . . . . . . . 89
asynchronous operation . . . . . . . . . . . . . . . . . 89 17. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 92
17. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 90 17.1. Operation 15: CB_OFFLOAD - Report results of an
18. References . . . . . . . . . . . . . . . . . . . . . . . . . 90 asynchronous operation . . . . . . . . . . . . . . . . . 92
18.1. Normative References . . . . . . . . . . . . . . . . . . 90 18. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 94
18.2. Informative References . . . . . . . . . . . . . . . . . 91 19. References . . . . . . . . . . . . . . . . . . . . . . . . . 94
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 92 19.1. Normative References . . . . . . . . . . . . . . . . . . 94
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 93 19.2. Informative References . . . . . . . . . . . . . . . . . 94
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 93 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 96
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 97
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 97
1. Introduction 1. Introduction
1.1. The NFS Version 4 Minor Version 2 Protocol 1.1. The NFS Version 4 Minor Version 2 Protocol
The NFS version 4 minor version 2 (NFSv4.2) protocol is the third The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
minor version of the NFS version 4 (NFSv4) protocol. The first minor minor version of the NFS version 4 (NFSv4) protocol. The first minor
version, NFSv4.0, is described in [I-D.ietf-nfsv4-rfc3530bis] and the version, NFSv4.0, is described in [I-D.ietf-nfsv4-rfc3530bis] and the
second minor version, NFSv4.1, is described in [RFC5661]. second minor version, NFSv4.1, is described in [RFC5661].
skipping to change at page 5, line 50 skipping to change at page 6, line 8
A traditional file copy from one server to another results in the A traditional file copy from one server to another results in the
data being put on the network twice - source to client and then data being put on the network twice - source to client and then
client to destination. New operations are introduced to allow the client to destination. New operations are introduced to allow the
client to authorize the two servers to interact directly. As this client to authorize the two servers to interact directly. As this
copy can be lengthy, asynchronous support is also provided. copy can be lengthy, asynchronous support is also provided.
1.4.2. Application I/O Advise 1.4.2. Application I/O Advise
Applications and clients want to advise the server as to expected I/O Applications and clients want to advise the server as to expected I/O
behavior. Using IO_ADVISE (see Section 15.6) to communicate future I behavior. Using IO_ADVISE (see Section 16.5) to communicate future I
/O behavior such as whether a file will be accessed sequentially or /O behavior such as whether a file will be accessed sequentially or
randomly, and whether a file will or will not be accessed in the near randomly, and whether a file will or will not be accessed in the near
future, allows servers to optimize future I/O requests for a file by, future, allows servers to optimize future I/O requests for a file by,
for example, prefetching or evicting data. This operation can be for example, prefetching or evicting data. This operation can be
used to support the posix_fadvise function as well as other used to support the posix_fadvise function as well as other
applications such as databases and video editors. applications such as databases and video editors.
1.4.3. Sparse Files 1.4.3. Sparse Files
Sparse files are ones which have unallocated or uninitialized data Sparse files are ones which have unallocated or uninitialized data
blocks as holes in the file. Such holes are typically transferred as blocks as holes in the file. Such holes are typically transferred as
0s during I/O. READ_PLUS (see Section 15.11) allows a server to send 0s during I/O. READ_PLUS (see Section 16.10) allows a server to send
back to the client metadata describing the hole and DEALLOCATE (see back to the client metadata describing the hole and DEALLOCATE (see
Section 15.5) allows the client to punch holes into a file. In Section 16.4) allows the client to punch holes into a file. In
addition, SEEK (see Section 15.12) is provided to scan for the next addition, SEEK (see Section 16.11) is provided to scan for the next
hole or data from a given location. hole or data from a given location.
1.4.4. Space Reservation 1.4.4. Space Reservation
When a file is sparse, one concern applications have is ensuring that When a file is sparse, one concern applications have is ensuring that
there will always be enough data blocks available for the file during there will always be enough data blocks available for the file during
future writes. ALLOCATE (see Section 15.1) allows a client to future writes. ALLOCATE (see Section 16.1) allows a client to
request a guarantee that space will be available. And DEALLOCATE request a guarantee that space will be available. And DEALLOCATE
(see Section 15.5) allows the client to punch a hole into a file, (see Section 16.4) allows the client to punch a hole into a file,
thus releasing a space reservation. thus releasing a space reservation.
1.4.5. Application Data Block (ADB) Support 1.4.5. Application Data Block (ADB) Support
Some applications treat a file as if it were a disk and as such want Some applications treat a file as if it were a disk and as such want
to initialize (or format) the file image. We introduce WRITE_SAME to initialize (or format) the file image. We introduce WRITE_SAME
(see Section 15.13) to send this metadata to the server to allow it (see Section 16.12) to send this metadata to the server to allow it
to write the block contents. to write the block contents.
1.4.6. Labeled NFS 1.4.6. Labeled NFS
While both clients and servers can employ Mandatory Access Control While both clients and servers can employ Mandatory Access Control
(MAC) security models to enforce data access, there has been no (MAC) security models to enforce data access, there has been no
protocol support to allow full interoperability. A new file object protocol support for interoperability. A new file object attribute,
attribute, sec_label (see Section 13.2.2) allows for the server to sec_label (see Section 13.2.2) allows for the server to store MAC
store and enforce MAC labels. The format of the sec_label labels on files, which the client retrieves and uses to enforce data
accommodates any MAC security system. access (see Section 9.6.2). The format of the sec_label accommodates
any MAC security system.
1.5. Differences from NFSv4.1 1.5. Differences from NFSv4.1
In NFSv4.1, the only way to introduce new variants of an operation In NFSv4.1, the only way to introduce new variants of an operation
was to introduce a new operation. I.e., READ becomes either READ2 or was to introduce a new operation. I.e., READ becomes either READ2 or
READ_PLUS. With the use of discriminated unions as parameters to READ_PLUS. With the use of discriminated unions as parameters to
such functions in NFSv4.2, it is possible to add a new arm in a such functions in NFSv4.2, it is possible to add a new arm in a
subsequent minor version. And it is also possible to move such an subsequent minor version. And it is also possible to move such an
operation from OPTIONAL/RECOMMENDED to REQUIRED. Forcing an operation from OPTIONAL/RECOMMENDED to REQUIRED. Forcing an
implementation to adopt each arm of a discriminated union at such a implementation to adopt each arm of a discriminated union at such a
skipping to change at page 10, line 33 skipping to change at page 10, line 42
16. Unless explicitly documented in a minor version standard's 16. Unless explicitly documented in a minor version standard's
document, a client MUST NOT attempt to use a stateid, document, a client MUST NOT attempt to use a stateid,
filehandle, or similar returned object from the COMPOUND filehandle, or similar returned object from the COMPOUND
procedure with minor version X for another COMPOUND procedure procedure with minor version X for another COMPOUND procedure
with minor version Y, where X != Y. with minor version Y, where X != Y.
3. pNFS considerations for New Operations 3. pNFS considerations for New Operations
3.1. Atomicity for ALLOCATE and DEALLOCATE 3.1. Atomicity for ALLOCATE and DEALLOCATE
Both ALLOCATE (see Section 15.1) and DEALLOCATE (see Section 15.5) Both ALLOCATE (see Section 16.1) and DEALLOCATE (see Section 16.4)
are sent to the metadata server, which is responsible for are sent to the metadata server, which is responsible for
coordinating the changes onto the storage devices. In particular, coordinating the changes onto the storage devices. In particular,
both operations must either fully succeed or fail, it cannot be the both operations must either fully succeed or fail, it cannot be the
case that one storage device succeeds whilst another fails. case that one storage device succeeds whilst another fails.
3.2. Sharing of stateids with NFSv4.1 3.2. Sharing of stateids with NFSv4.1
A NFSv4.2 metadata server can hand out a layout to a NFSv4.1 storage A NFSv4.2 metadata server can hand out a layout to a NFSv4.1 storage
device. Section 13.9.1 of [RFC5661] discusses how the client gets a device. Section 13.9.1 of [RFC5661] discusses how the client gets a
stateid from the metadata server to present to a storage device. stateid from the metadata server to present to a storage device.
skipping to change at page 11, line 11 skipping to change at page 11, line 26
NFSv4.2, in which case the rules in Section 3.3.1 apply. As the File NFSv4.2, in which case the rules in Section 3.3.1 apply. As the File
Layout Type does not provide a means for informing the client as to Layout Type does not provide a means for informing the client as to
which minor version a particular storage device is providing, it will which minor version a particular storage device is providing, it will
have to negotiate this via the normal RPC semantics of major and have to negotiate this via the normal RPC semantics of major and
minor version discovery. minor version discovery.
3.3.1. Operations Sent to NFSv4.2 Data Servers 3.3.1. Operations Sent to NFSv4.2 Data Servers
In addition to the commands listed in [RFC5661], NFSv4.2 data servers In addition to the commands listed in [RFC5661], NFSv4.2 data servers
MAY accept a COMPOUND containing the following additional operations: MAY accept a COMPOUND containing the following additional operations:
IO_ADVISE (see Section 15.6), READ_PLUS (see Section 15.11), IO_ADVISE (see Section 16.5), READ_PLUS (see Section 16.10),
WRITE_SAME (see Section 15.13), and SEEK (see Section 15.12), which WRITE_SAME (see Section 16.12), and SEEK (see Section 16.11), which
will be treated like the subset specified as "Operations Sent to will be treated like the subset specified as "Operations Sent to
NFSv4.1 Data Servers" in Section 13.6 of [RFC5661]. NFSv4.1 Data Servers" in Section 13.6 of [RFC5661].
Additional details on the implementation of these operations in a Additional details on the implementation of these operations in a
pNFS context are documented in the operation specific sections. pNFS context are documented in the operation specific sections.
4. Server Side Copy 4. Server Side Copy
4.1. Introduction 4.1. Introduction
skipping to change at page 12, line 7 skipping to change at page 12, line 24
Throughout the rest of this document, we refer to the NFS server Throughout the rest of this document, we refer to the NFS server
containing the source file as the "source server" and the NFS server containing the source file as the "source server" and the NFS server
to which the file is transferred as the "destination server". In the to which the file is transferred as the "destination server". In the
case of an intra-server copy, the source server and destination case of an intra-server copy, the source server and destination
server are the same server. Therefore in the context of an intra- server are the same server. Therefore in the context of an intra-
server copy, the terms source server and destination server refer to server copy, the terms source server and destination server refer to
the single server performing the copy. the single server performing the copy.
The new operations are designed to copy files. Other file system The new operations are designed to copy files. Other file system
objects can be copied by building on these operations or using other objects can be copied by building on these operations or using other
techniques. For example if the user wishes to copy a directory, the techniques. For example, if the user wishes to copy a directory, the
client can synthesize a directory copy by first creating the client can synthesize a directory copy by first creating the
destination directory and then copying the source directory's files destination directory and then copying the source directory's files
to the new destination directory. to the new destination directory.
For the inter-server copy, the operations are defined to be For the inter-server copy, the operations are defined to be
compatible with the traditional copy authentication approach. The compatible with the traditional copy authentication approach. The
client and user are authorized at the source for reading. Then they client and user are authorized at the source for reading. Then they
are authorized at the destination for writing. are authorized at the destination for writing.
4.2.1. Copy Operations 4.2.1. Copy Operations
COPY_NOTIFY: Used by the client to notify the source server of a COPY_NOTIFY: Used by the client to notify the source server of a
future file copy from a given destination server for the given future file copy from a given destination server for the given
user. (Section 15.3) user. (Section 16.3)
COPY: Used by the client to request a file copy. (Section 15.2) COPY: Used by the client to request a file copy. (Section 16.2)
OFFLOAD_CANCEL: Used by the client to terminate an asynchronous file OFFLOAD_CANCEL: Used by the client to terminate an asynchronous file
copy. (Section 15.9) copy. (Section 16.8)
OFFLOAD_STATUS: Used by the client to poll the status of an OFFLOAD_STATUS: Used by the client to poll the status of an
asynchronous file copy. (Section 15.10) asynchronous file copy. (Section 16.9)
CB_OFFLOAD: Used by the destination server to report the results of CB_OFFLOAD: Used by the destination server to report the results of
an asynchronous file copy to the client. (Section 16.1) an asynchronous file copy to the client. (Section 17.1)
4.2.2. Requirements for Operations 4.2.2. Requirements for Operations
The implementation of server-side copy is OPTIONAL by the client and The implementation of server-side copy is OPTIONAL by the client and
the server. However, in order to successfully copy a file, some the server. However, in order to successfully copy a file, some
operations MUST be supported by the client and/or server. operations MUST be supported by the client and/or server.
If a client desires an intra-server file copy, then it MUST support If a client desires an intra-server file copy, then it MUST support
the COPY and CB_OFFLOAD operations. If COPY returns a stateid, then the COPY and CB_OFFLOAD operations. If COPY returns a stateid, then
the client MAY use the OFFLOAD_CANCEL and OFFLOAD_STATUS operations. the client MAY use the OFFLOAD_CANCEL and OFFLOAD_STATUS operations.
skipping to change at page 13, line 46 skipping to change at page 14, line 17
destination first have a "copying relationship" increases the destination first have a "copying relationship" increases the
administrative burden. However the specification MUST NOT administrative burden. However the specification MUST NOT
preclude implementations that require preconfiguration. preclude implementations that require preconfiguration.
o The specification MUST NOT mandate a trust relationship between o The specification MUST NOT mandate a trust relationship between
the source and destination server. The NFSv4 security model the source and destination server. The NFSv4 security model
requires mutual authentication between a principal on an NFS requires mutual authentication between a principal on an NFS
client and a principal on an NFS server. This model MUST continue client and a principal on an NFS server. This model MUST continue
with the introduction of COPY. with the introduction of COPY.
4.4. Locking the Files 4.4. Implementation Considerations
4.4.1. Locking the Files
Both the source and destination file may need to be locked to protect Both the source and destination file may need to be locked to protect
the content during the copy operations. A client can achieve this by the content during the copy operations. A client can achieve this by
a combination of OPEN and LOCK operations. I.e., either share or a combination of OPEN and LOCK operations. I.e., either share or
byte range locks might be desired. byte range locks might be desired.
Note that when the client establishes a lock stateid on the source,
the context of that stateid is for the client and not the
destination. As such, there might already be an outstanding stateid,
issued to the destination as client of the source, with the same
value as that provided for the lock stateid. The source MUST equate
the lock stateid as that of the client, i.e., when the destination
presents it in the context of a inter-server copy, it is on behalf of
the client.
4.4.2. Client Caches
In a traditional copy, if the client is in the process of writing to
the file before the copy (and perhaps with a write delegation), it
will be straightforward to update the destination server. With an
inter-server copy, the source has no insight into the changes cached
on the client. The client SHOULD write back the data to the source
or be prepared for the destination to get a corrupt copy of the file.
4.5. Intra-Server Copy 4.5. Intra-Server Copy
To copy a file on a single server, the client uses a COPY operation. To copy a file on a single server, the client uses a COPY operation.
The server may respond to the copy operation with the final results The server may respond to the copy operation with the final results
of the copy or it may perform the copy asynchronously and deliver the of the copy or it may perform the copy asynchronously and deliver the
results using a CB_OFFLOAD operation callback. If the copy is results using a CB_OFFLOAD operation callback. If the copy is
performed asynchronously, the client may poll the status of the copy performed asynchronously, the client may poll the status of the copy
using OFFLOAD_STATUS or cancel the copy using OFFLOAD_CANCEL. using OFFLOAD_STATUS or cancel the copy using OFFLOAD_CANCEL.
A synchronous intra-server copy is shown in Figure 1. In this A synchronous intra-server copy is shown in Figure 1. In this
skipping to change at page 28, line 30 skipping to change at page 29, line 30
Note that the use of the "copy_confirm_auth" privilege accomplishes Note that the use of the "copy_confirm_auth" privilege accomplishes
the following: the following:
o If a protocol like NFS is being used, with export policies, export o If a protocol like NFS is being used, with export policies, export
policies can be overridden in case the destination server as-an- policies can be overridden in case the destination server as-an-
NFS-client is not authorized NFS-client is not authorized
o Manual configuration to allow a copy relationship between the o Manual configuration to allow a copy relationship between the
source and destination is not needed. source and destination is not needed.
4.10.1.1.4. Finishing or Stopping a Secure Inter-Server Copy 4.10.1.1.4. Maintaining a Secure Inter-Server Copy
The secure inter-server copy depends upon both the source server and
the destination server keeping the copy_from_auth and copy_to_auth
RPCSEC_GSS3 context handles valid during the copy. The client SHOULD
use the copy_from_auth RPCSEC_GSS3 context handle for the NFSv4 lease
renewing operation to the source server, and the copy_to_auth
RPCSEC_GSS3 context handle for the NFSv4 lease renewing operation to
the destination server during the copy to periodically determine the
continued validity of the respective GSS3 handles. A periodic RPC
NULL call can also be used for this purpose.
If the client determines that either handle becomes invalid during a
copy, then the copy MUST be aborted by the client sending an
OFFLOAD_CANCEL to both the source and destination servers and
destroying the respective copy related context handles as described
in Section 4.10.1.1.5.
4.10.1.1.5. Finishing or Stopping a Secure Inter-Server Copy
Under normal operation, the client MUST destroy the copy_from_auth Under normal operation, the client MUST destroy the copy_from_auth
and the copy_to_auth RPCSEC_GSSv3 handle once the COPY operation and the copy_to_auth RPCSEC_GSSv3 handle once the COPY operation
returns for a synchronous inter-server copy or a CB_OFFLOAD reports returns for a synchronous inter-server copy or a CB_OFFLOAD reports
the result of an asynchronous copy. the result of an asynchronous copy.
The copy_confirm_auth privilege and compound authentication The copy_confirm_auth privilege and compound authentication
RPCSEC_GSSv3 handle is constructed from information held by the RPCSEC_GSSv3 handle is constructed from information held by the
copy_to_auth privilege, and MUST be destroyed by the destination copy_to_auth privilege, and MUST be destroyed by the destination
server (via an RPCSEC_GSS3_DESTROY call) when the copy_to_auth server (via an RPCSEC_GSS3_DESTROY call) when the copy_to_auth
RPCSEC_GSSv3 handle is destroyed. RPCSEC_GSSv3 handle is destroyed.
The copy_confirm_auth RPCSEC_GSS3 handle is associated with a
copy_from_auth RPCSEC_GSS3 handle on the source server via the shared
secret and MUST be locally destroyed (there is no RPCSEC_GSS3_DESTROY
as the source server is not the initiator) when the copy_from_auth
RPCSEC_GSSv3 handle is destroyed.
If the client sends an OFFLOAD_CANCEL to the source server to rescind If the client sends an OFFLOAD_CANCEL to the source server to rescind
the destination server's synchronous copy privilege, it uses the the destination server's synchronous copy privilege, it uses the
privileged "copy_from_auth" RPCSEC_GSSv3 handle and the privileged "copy_from_auth" RPCSEC_GSSv3 handle and the
cra_destination_server in OFFLOAD_CANCEL MUST be the same as the name cra_destination_server in OFFLOAD_CANCEL MUST be the same as the name
of the destination server specified in copy_from_auth_priv. The of the destination server specified in copy_from_auth_priv. The
source server will then delete the <"copy_from_auth", user id, source server will then delete the <"copy_from_auth", user id,
destination> privilege and fail any subsequent copy requests sent destination> privilege and fail any subsequent copy requests sent
under the auspices of this privilege from the destination server. under the auspices of this privilege from the destination server.
The client MUST destroy both the "copy_from_auth" and the The client MUST destroy both the "copy_from_auth" and the
"copy_to_auth" RPCSEC_GSSv3 handles. "copy_to_auth" RPCSEC_GSSv3 handles.
If the client sends an OFFLOAD_STATUS to the destination server to If the client sends an OFFLOAD_STATUS to the destination server to
check on the status of an asynchronous copy, it uses the privileged check on the status of an asynchronous copy, it uses the privileged
"copy_to_auth" RPCSEC_GSSv3 handle and the osa_stateid in "copy_to_auth" RPCSEC_GSSv3 handle and the osa_stateid in
OFFLOAD_STATUS MUST be the same as the wr_callback_id specified in OFFLOAD_STATUS MUST be the same as the wr_callback_id specified in
the "copy_to_auth" privilege stored on the destination server. the "copy_to_auth" privilege stored on the destination server.
If the client sends an OFFLOAD_CANCEL to the destination server to If the client sends an OFFLOAD_CANCEL to the destination server to
skipping to change at page 31, line 11 skipping to change at page 32, line 36
The same techniques as Section 4.10.1.2, using unique URLs for each The same techniques as Section 4.10.1.2, using unique URLs for each
destination server, can be used for other protocols (e.g., HTTP destination server, can be used for other protocols (e.g., HTTP
[RFC2616] and FTP [RFC959]) as well. [RFC2616] and FTP [RFC959]) as well.
5. Support for Application IO Hints 5. Support for Application IO Hints
Applications can issue client I/O hints via posix_fadvise() Applications can issue client I/O hints via posix_fadvise()
[posix_fadvise] to the NFS client. While this can help the NFS [posix_fadvise] to the NFS client. While this can help the NFS
client optimize I/O and caching for a file, it does not allow the NFS client optimize I/O and caching for a file, it does not allow the NFS
server and its exported file system to do likewise. We add an server and its exported file system to do likewise. We add an
IO_ADVISE procedure (Section 15.6) to communicate the client file IO_ADVISE procedure (Section 16.5) to communicate the client file
access patterns to the NFS server. The NFS server upon receiving a access patterns to the NFS server. The NFS server upon receiving a
IO_ADVISE operation MAY choose to alter its I/O and caching behavior, IO_ADVISE operation MAY choose to alter its I/O and caching behavior,
but is under no obligation to do so. but is under no obligation to do so.
Application specific NFS clients such as those used by hypervisors Application specific NFS clients such as those used by hypervisors
and databases can also leverage application hints to communicate and databases can also leverage application hints to communicate
their specialized requirements. their specialized requirements.
6. Sparse Files 6. Sparse Files
skipping to change at page 32, line 15 skipping to change at page 33, line 40
metadata and 100M in the data. metadata and 100M in the data.
No new operation is needed to allow the creation of a sparsely No new operation is needed to allow the creation of a sparsely
populated file, when a file is created and a write occurs past the populated file, when a file is created and a write occurs past the
current size of the file, the non-allocated region will either be a current size of the file, the non-allocated region will either be a
hole or filled with zeros. The choice of behavior is dictated by the hole or filled with zeros. The choice of behavior is dictated by the
underlying file system and is transparent to the application. What underlying file system and is transparent to the application. What
is needed are the abilities to read sparse files and to punch holes is needed are the abilities to read sparse files and to punch holes
to reinitialize the contents of a file. to reinitialize the contents of a file.
Two new operations DEALLOCATE (Section 15.5) and READ_PLUS Two new operations DEALLOCATE (Section 16.4) and READ_PLUS
(Section 15.11) are introduced. DEALLOCATE allows for the hole (Section 16.10) are introduced. DEALLOCATE allows for the hole
punching. I.e., an application might want to reset the allocation punching. I.e., an application might want to reset the allocation
and reservation status of a range of the file. READ_PLUS supports and reservation status of a range of the file. READ_PLUS supports
all the features of READ but includes an extension to support sparse all the features of READ but includes an extension to support sparse
files. READ_PLUS is guaranteed to perform no worse than READ, and files. READ_PLUS is guaranteed to perform no worse than READ, and
can dramatically improve performance with sparse files. READ_PLUS can dramatically improve performance with sparse files. READ_PLUS
does not depend on pNFS protocol features, but can be used by pNFS to does not depend on pNFS protocol features, but can be used by pNFS to
support sparse files. support sparse files.
6.2. Terminology 6.2. Terminology
skipping to change at page 33, line 48 skipping to change at page 35, line 31
machine from continuing execution and result in downtime. machine from continuing execution and result in downtime.
Currently, in order to achieve such a guarantee, applications zero Currently, in order to achieve such a guarantee, applications zero
the entire file. The initial zeroing allocates the backing blocks the entire file. The initial zeroing allocates the backing blocks
and all subsequent writes are overwrites of already allocated blocks. and all subsequent writes are overwrites of already allocated blocks.
This approach is not only inefficient in terms of the amount of I/O This approach is not only inefficient in terms of the amount of I/O
done, it is also not guaranteed to work on file systems that are log done, it is also not guaranteed to work on file systems that are log
structured or deduplicated. An efficient way of guaranteeing space structured or deduplicated. An efficient way of guaranteeing space
reservation would be beneficial to such applications. reservation would be beneficial to such applications.
The new ALLOCATE operation (see Section 15.1) allows a client to The new ALLOCATE operation (see Section 16.1) allows a client to
request a guarantee that space will be available. The ALLOCATE request a guarantee that space will be available. The ALLOCATE
operation guarantees that any future writes to the region it was operation guarantees that any future writes to the region it was
successfully called for will not fail with NFS4ERR_NOSPC. successfully called for will not fail with NFS4ERR_NOSPC.
Another useful feature is the ability to report the number of blocks Another useful feature is the ability to report the number of blocks
that would be freed when a file is deleted. Currently, NFS reports that would be freed when a file is deleted. Currently, NFS reports
two size attributes: two size attributes:
size The logical file size of the file. size The logical file size of the file.
skipping to change at page 35, line 8 skipping to change at page 36, line 40
reporting of the space utilization. reporting of the space utilization.
For example, two files A and B have 10 blocks each. Let 6 of these For example, two files A and B have 10 blocks each. Let 6 of these
blocks be shared between them. Thus, the combined space utilized by blocks be shared between them. Thus, the combined space utilized by
the two files is 14 * BLOCK_SIZE bytes. In the former case, the the two files is 14 * BLOCK_SIZE bytes. In the former case, the
combined space utilization of the two files would be reported as 20 * combined space utilization of the two files would be reported as 20 *
BLOCK_SIZE. However, deleting either would only result in 4 * BLOCK_SIZE. However, deleting either would only result in 4 *
BLOCK_SIZE being freed. Conversely, the latter interpretation would BLOCK_SIZE being freed. Conversely, the latter interpretation would
report that the space utilization is only 8 * BLOCK_SIZE. report that the space utilization is only 8 * BLOCK_SIZE.
Adding another size attribute, space_freed (see Section 13.2.4), is Adding another size attribute, space_freed (see Section 13.2.3), is
helpful in solving this problem. space_freed is the number of blocks helpful in solving this problem. space_freed is the number of blocks
that are allocated to the given file that would be freed on its that are allocated to the given file that would be freed on its
deletion. In the example, both A and B would report space_freed as 4 deletion. In the example, both A and B would report space_freed as 4
* BLOCK_SIZE and space_used as 10 * BLOCK_SIZE. If A is deleted, B * BLOCK_SIZE and space_used as 10 * BLOCK_SIZE. If A is deleted, B
will report space_freed as 10 * BLOCK_SIZE as the deletion of B would will report space_freed as 10 * BLOCK_SIZE as the deletion of B would
result in the deallocation of all 10 blocks. result in the deallocation of all 10 blocks.
The addition of this problem does not solve the problem of space The addition of this problem does not solve the problem of space
being over-reported. However, over-reporting is better than under- being over-reported. However, over-reporting is better than under-
reporting. reporting.
skipping to change at page 38, line 35 skipping to change at page 40, line 19
occur in the transport layer. The server and client components are occur in the transport layer. The server and client components are
totally unaware of the file format and might report everything as totally unaware of the file format and might report everything as
being transferred correctly even in the case the application detects being transferred correctly even in the case the application detects
corruption. corruption.
8.3. Example of READ_PLUS 8.3. Example of READ_PLUS
The hypothetical application presented in Section 8.2 can be used to The hypothetical application presented in Section 8.2 can be used to
illustrate how READ_PLUS would return an array of results. A file is illustrate how READ_PLUS would return an array of results. A file is
created and initialized with 100 4k ADBs in the FREE state with the created and initialized with 100 4k ADBs in the FREE state with the
WRITE_SAME operation (see Section 15.13): WRITE_SAME operation (see Section 16.12):
WRITE_SAME {0, 4k, 100, 0, 0, 8, 0xfeedface} WRITE_SAME {0, 4k, 100, 0, 0, 8, 0xfeedface}
Further, assume the application writes a single ADB at 16k, changing Further, assume the application writes a single ADB at 16k, changing
the guard pattern to 0xcafedead, we would then have in memory: the guard pattern to 0xcafedead, we would then have in memory:
0k -> (4k - 1) : 00 00 00 00 fe ed fa ce 00 00 ... 00 00 0k -> (4k - 1) : 00 00 00 00 fe ed fa ce 00 00 ... 00 00
4k -> (8k - 1) : 00 00 00 01 fe ed fa ce 00 00 ... 00 00 4k -> (8k - 1) : 00 00 00 01 fe ed fa ce 00 00 ... 00 00
8k -> (12k - 1) : 00 00 00 02 fe ed fa ce 00 00 ... 00 00 8k -> (12k - 1) : 00 00 00 02 fe ed fa ce 00 00 ... 00 00
12k -> (16k - 1) : 00 00 00 03 fe ed fa ce 00 00 ... 00 00 12k -> (16k - 1) : 00 00 00 03 fe ed fa ce 00 00 ... 00 00
skipping to change at page 39, line 45 skipping to change at page 41, line 32
models base their access control decisions on the label on the models base their access control decisions on the label on the
subject (usually a process) and the object it wishes to access subject (usually a process) and the object it wishes to access
[RFC7204]. These labels may contain user identity information but [RFC7204]. These labels may contain user identity information but
usually contain additional information. In DAC systems users are usually contain additional information. In DAC systems users are
free to specify the access rules for resources that they own. MAC free to specify the access rules for resources that they own. MAC
models base their security decisions on a system wide policy models base their security decisions on a system wide policy
established by an administrator or organization which the users do established by an administrator or organization which the users do
not have the ability to override. In this section, we add a MAC not have the ability to override. In this section, we add a MAC
model to NFSv4.2. model to NFSv4.2.
The first change necessary is to devise a method for transporting and First we provide a method for transporting and storing security label
storing security label data on NFSv4 file objects. Security labels data on NFSv4 file objects. Security labels have several semantics
have several semantics that are met by NFSv4 recommended attributes that are met by NFSv4 recommended attributes such as the ability to
such as the ability to set the label value upon object creation. set the label value upon object creation. Access control on these
Access control on these attributes are done through a combination of attributes are done through a combination of two mechanisms. As with
two mechanisms. As with other recommended attributes on file objects other recommended attributes on file objects the usual DAC checks
the usual DAC checks (ACLs and permission bits) will be performed to (ACLs and permission bits) will be performed to ensure that proper
ensure that proper file ownership is enforced. In addition a MAC file ownership is enforced. In addition a MAC system MAY be employed
system MAY be employed on the client, server, or both to enforce on the client, server, or both to enforce additional policy on what
additional policy on what subjects may modify security label subjects may modify security label information.
information.
The second change is to provide methods for the client to determine Second, we describe a method for the client to determine if an NFSv4
if the security label has changed. A client which needs to know if a file object security label has changed. A client which needs to know
label is going to change SHOULD request a delegation on that file. if a label on a file or set of files is going to change SHOULD
In order to change the security label, the server will have to recall request a delegation on each labeled file. In order to change such a
all delegations. This will inform the client of the change. security label, the server will have to recall delegations on any
file affected by the label change, so informing clients of the label
change.
An additional useful change would be modification to the RPC layer An additional useful feature would be modification to the RPC layer
used in NFSv4 to allow RPC calls to carry security labels. Such used by NFSv4 to allow RPC calls to carry security labels and enable
full mode enforcement as described in Section 9.6.1. Such
modifications are outside the scope of this document (see modifications are outside the scope of this document (see
[rpcsec_gssv3]). [rpcsec_gssv3]).
9.2. Definitions 9.2. Definitions
Label Format Specifier (LFS): is an identifier used by the client to Label Format Specifier (LFS): is an identifier used by the client to
establish the syntactic format of the security label and the establish the syntactic format of the security label and the
semantic meaning of its components. These specifiers exist in a semantic meaning of its components. These specifiers exist in a
registry associated with documents describing the format and registry associated with documents describing the format and
semantics of the label. semantics of the label.
Label Format Registry: is the IANA registry containing all Label Format Registry: is the IANA registry (see [Quigley14])
registered LFS along with references to the documents that containing all registered LFSes along with references to the
describe the syntactic format and semantics of the security label. documents that describe the syntactic format and semantics of the
security label.
Policy Identifier (PI): is an optional part of the definition of a Policy Identifier (PI): is an optional part of the definition of a
Label Format Specifier which allows for clients and server to Label Format Specifier which allows for clients and server to
identify specific security policies. identify specific security policies.
Object: is a passive resource within the system that we wish to be Object: is a passive resource within the system that we wish to be
protected. Objects can be entities such as files, directories, protected. Objects can be entities such as files, directories,
pipes, sockets, and many other system resources relevant to the pipes, sockets, and many other system resources relevant to the
protection of the system state. protection of the system state.
skipping to change at page 40, line 51 skipping to change at page 42, line 41
access to an object. access to an object.
MAC-Aware: is a server which can transmit and store object labels. MAC-Aware: is a server which can transmit and store object labels.
MAC-Functional: is a client or server which is Labeled NFS enabled. MAC-Functional: is a client or server which is Labeled NFS enabled.
Such a system can interpret labels and apply policies based on the Such a system can interpret labels and apply policies based on the
security system. security system.
Multi-Level Security (MLS): is a traditional model where objects are Multi-Level Security (MLS): is a traditional model where objects are
given a sensitivity level (Unclassified, Secret, Top Secret, etc) given a sensitivity level (Unclassified, Secret, Top Secret, etc)
and a category set [MLS]. and a category set (see [BL73], [RFC1108], and [RFC2401]).
9.3. MAC Security Attribute 9.3. MAC Security Attribute
MAC models base access decisions on security attributes bound to MAC models base access decisions on security attributes bound to
subjects and objects. This information can range from a user subjects and objects. This information can range from a user
identity for an identity based MAC model, sensitivity levels for identity for an identity based MAC model, sensitivity levels for
Multi-level security, or a type for Type Enforcement. These models Multi-level security, or a type for Type Enforcement. These models
base their decisions on different criteria but the semantics of the base their decisions on different criteria but the semantics of the
security attribute remain the same. The semantics required by the security attribute remain the same. The semantics required by the
security attributes are listed below: security attributes are listed below:
skipping to change at page 42, line 40 skipping to change at page 44, line 28
9.3.4. Existing Objects 9.3.4. Existing Objects
Note that under the MAC model, all objects must have labels. Note that under the MAC model, all objects must have labels.
Therefore, if an existing server is upgraded to include Labeled NFS Therefore, if an existing server is upgraded to include Labeled NFS
support, then it is the responsibility of the security system to support, then it is the responsibility of the security system to
define the behavior for existing objects. define the behavior for existing objects.
9.3.5. Label Changes 9.3.5. Label Changes
If there are open delegations on the file belonging to client other Consider a guest mode system (Section 9.6.2) in which the clients
than the one making the label change, then the process described in enforce MAC checks and the server has only a DAC security system
Section 9.3.1 must be followed. In short, the delegation will be which stores the labels along with the file data. In this type of
recalled, which effectively notifies the client of the change. system, a user with the appropriate DAC credentials on a client with
poorly configured or disabled MAC labeling enforcement is allowed
access to the file label (and data) on the server and can change the
label.
Consider a system in which the clients enforce MAC checks and and the Clients which need to know if a label on a file or set of files has
server has a very simple security system which just stores the changed SHOULD request a delegation on each labeled file so that a
labels. In this system, the MAC label check always allows access, label change by another client will be known via the process
regardless of the subject label. described in Section 9.3.1 which must be followed: the delegation
will be recalled, which effectively notifies the client of the
change.
The way in which MAC labels are enforced is by the client. The Note that the MAC security policies on a client can be such that the
security policies on the client can be such that the client does not client does not have access to the file unless it has a delegation.
have access to the file unless it has a delegation. The recall of
the delegation will force the client to flush any cached content of
the file.
9.4. pNFS Considerations 9.4. pNFS Considerations
The new FATTR4_SEC_LABEL attribute is metadata information and as The new FATTR4_SEC_LABEL attribute is metadata information and as
such the DS is not aware of the value contained on the MDS. such the DS is not aware of the value contained on the MDS.
Fortunately, the NFSv4.1 protocol [RFC5661] already has provisions Fortunately, the NFSv4.1 protocol [RFC5661] already has provisions
for doing access level checks from the DS to the MDS. In order for for doing access level checks from the DS to the MDS. In order for
the DS to validate the subject label presented by the client, it the DS to validate the subject label presented by the client, it
SHOULD utilize this mechanism. SHOULD utilize this mechanism.
skipping to change at page 44, line 26 skipping to change at page 46, line 20
client to make a decision as to the acceptable security attributes to client to make a decision as to the acceptable security attributes to
create a file with before sending the request to the server. Once create a file with before sending the request to the server. Once
the server receives the creation request from the client it may the server receives the creation request from the client it may
choose to evaluate if the security attribute is acceptable. choose to evaluate if the security attribute is acceptable.
Security attributes on the client and server may vary based on MAC Security attributes on the client and server may vary based on MAC
model and policy. To handle this the security attribute field has an model and policy. To handle this the security attribute field has an
LFS component. This component is a mechanism for the host to LFS component. This component is a mechanism for the host to
identify the format and meaning of the opaque portion of the security identify the format and meaning of the opaque portion of the security
attribute. A full mode environment may contain hosts operating in attribute. A full mode environment may contain hosts operating in
several different LFSs. In this case a mechanism for translating the several different LFSes. In this case a mechanism for translating
opaque portion of the security attribute is needed. The actual the opaque portion of the security attribute is needed. The actual
translation function will vary based on MAC model and policy and is translation function will vary based on MAC model and policy and is
out of the scope of this document. If a translation is unavailable out of the scope of this document. If a translation is unavailable
for a given LFS then the request MUST be denied. Another recourse is for a given LFS then the request MUST be denied. Another recourse is
to allow the host to provide a fallback mapping for unknown security to allow the host to provide a fallback mapping for unknown security
attributes. attributes.
9.6.1.2. Policy Enforcement 9.6.1.2. Policy Enforcement
In full mode access control decisions are made by both the clients In full mode access control decisions are made by both the clients
and servers. When a client makes a request it takes the security and servers. When a client makes a request it takes the security
skipping to change at page 50, line 12 skipping to change at page 52, line 12
| | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR, | | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR, |
| | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, |
| | NFS4ERR_NOTSUPP, NFS4ERR_OLD_STATEID, | | | NFS4ERR_NOTSUPP, NFS4ERR_OLD_STATEID, |
| | NFS4ERR_OPENMODE, NFS4ERR_OP_NOT_IN_SESSION, | | | NFS4ERR_OPENMODE, NFS4ERR_OP_NOT_IN_SESSION, |
| | NFS4ERR_REP_TOO_BIG, | | | NFS4ERR_REP_TOO_BIG, |
| | NFS4ERR_REP_TOO_BIG_TO_CACHE, | | | NFS4ERR_REP_TOO_BIG_TO_CACHE, |
| | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, | | | NFS4ERR_REQ_TOO_BIG, NFS4ERR_RETRY_UNCACHED_REP, |
| | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, |
| | NFS4ERR_STALE, NFS4ERR_SYMLINK, | | | NFS4ERR_STALE, NFS4ERR_SYMLINK, |
| | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE |
| GETDEVICELIST | NFS4ERR_NOTSUPP |
| LAYOUTERROR | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | | LAYOUTERROR | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, |
| | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, |
| | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, |
| | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, |
| | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_ISDIR, | | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_ISDIR, |
| | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, |
| | NFS4ERR_NOTSUPP, NFS4ERR_NO_GRACE, | | | NFS4ERR_NOTSUPP, NFS4ERR_NO_GRACE, |
| | NFS4ERR_OLD_STATEID, NFS4ERR_OP_NOT_IN_SESSION, | | | NFS4ERR_OLD_STATEID, NFS4ERR_OP_NOT_IN_SESSION, |
| | NFS4ERR_REP_TOO_BIG, | | | NFS4ERR_REP_TOO_BIG, |
| | NFS4ERR_REP_TOO_BIG_TO_CACHE, | | | NFS4ERR_REP_TOO_BIG_TO_CACHE, |
skipping to change at page 53, line 26 skipping to change at page 55, line 28
W means write-only (SETATTR may set, GETATTR may not retrieve). W means write-only (SETATTR may set, GETATTR may not retrieve).
R W means read/write (GETATTR may retrieve, SETATTR may set). R W means read/write (GETATTR may retrieve, SETATTR may set).
Defined in: The section of this specification that describes the Defined in: The section of this specification that describes the
attribute. attribute.
+------------------+----+-------------------+-----+----------------+ +------------------+----+-------------------+-----+----------------+
| Name | Id | Data Type | Acc | Defined in | | Name | Id | Data Type | Acc | Defined in |
+------------------+----+-------------------+-----+----------------+ +------------------+----+-------------------+-----+----------------+
| space_freed | 77 | length4 | R | Section 13.2.4 | | space_freed | 77 | length4 | R | Section 13.2.3 |
| change_attr_type | 78 | change_attr_type4 | R | Section 13.2.1 | | change_attr_type | 78 | change_attr_type4 | R | Section 13.2.1 |
| sec_label | 79 | sec_label4 | R W | Section 13.2.2 | | sec_label | 79 | sec_label4 | R W | Section 13.2.2 |
+------------------+----+-------------------+-----+----------------+ +------------------+----+-------------------+-----+----------------+
Table 4 Table 4
13.2. Attribute Definitions 13.2. Attribute Definitions
13.2.1. Attribute 78: change_attr_type 13.2.1. Attribute 78: change_attr_type
skipping to change at page 55, line 39 skipping to change at page 57, line 41
attribute. This component is dependent on the MAC model to interpret attribute. This component is dependent on the MAC model to interpret
and enforce. and enforce.
In particular, it is the responsibility of the LFS specification to In particular, it is the responsibility of the LFS specification to
define a maximum size for the opaque section, slai_data<>. When define a maximum size for the opaque section, slai_data<>. When
creating or modifying a label for an object, the client needs to be creating or modifying a label for an object, the client needs to be
guaranteed that the server will accept a label that is sized guaranteed that the server will accept a label that is sized
correctly. By both client and server being part of a specific MAC correctly. By both client and server being part of a specific MAC
model, the client will be aware of the size. model, the client will be aware of the size.
If a server supports sec_label, then it MUST also support 13.2.3. Attribute 77: space_freed
change_sec_label. Any modification to sec_label MUST modify the
value for change_sec_label.
13.2.3. Attribute 79: change_sec_label
The change_sec_label attribute is a read-only attribute per file. If
the value of sec_label for a file is not the same at two disparate
times then the values of change_sec_label at those times MUST be
different as well. The value of change_sec_label MAY change at other
times as well, but this should be rare, as that will require the
client to abort any operation in progress, re-read the label, and
retry the operation. As the sec_label is not bounded by size, this
attribute allows for VERIFY and NVERIFY to quickly determine if the
sec_label has been modified.
13.2.4. Attribute 77: space_freed
space_freed gives the number of bytes freed if the file is deleted. space_freed gives the number of bytes freed if the file is deleted.
This attribute is read only and is of type length4. It is a per file This attribute is read only and is of type length4. It is a per file
attribute. attribute.
14. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 14. Operations: REQUIRED, RECOMMENDED, or OPTIONAL
The following tables summarize the operations of the NFSv4.2 protocol The following tables summarize the operations of the NFSv4.2 protocol
and the corresponding designation of REQUIRED, RECOMMENDED, and and the corresponding designation of REQUIRED, RECOMMENDED, and
OPTIONAL to implement or either OBSOLESCENT or MUST NOT implement. OPTIONAL to implement or either OBSOLESCENT or MUST NOT implement.
skipping to change at page 58, line 10 skipping to change at page 59, line 46
| CREATE_SESSION | REQ | | | CREATE_SESSION | REQ | |
| DELEGPURGE | OPT | FDELG (REQ) | | DELEGPURGE | OPT | FDELG (REQ) |
| DELEGRETURN | OPT | FDELG, DDELG, pNFS | | DELEGRETURN | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
| DESTROY_CLIENTID | REQ | | | DESTROY_CLIENTID | REQ | |
| DESTROY_SESSION | REQ | | | DESTROY_SESSION | REQ | |
| EXCHANGE_ID | REQ | | | EXCHANGE_ID | REQ | |
| FREE_STATEID | REQ | | | FREE_STATEID | REQ | |
| GETATTR | REQ | | | GETATTR | REQ | |
| GETDEVICEINFO | OPT | pNFS (REQ) | | GETDEVICEINFO | OPT | pNFS (REQ) |
| GETDEVICELIST | OPT | pNFS (OPT) | | GETDEVICELIST | MNI | pNFS (MNI) |
| GETFH | REQ | | | GETFH | REQ | |
| GET_DIR_DELEGATION | OPT | DDELG (REQ) | | GET_DIR_DELEGATION | OPT | DDELG (REQ) |
| LAYOUTCOMMIT | OPT | pNFS (REQ) | | LAYOUTCOMMIT | OPT | pNFS (REQ) |
| LAYOUTGET | OPT | pNFS (REQ) | | LAYOUTGET | OPT | pNFS (REQ) |
| LAYOUTRETURN | OPT | pNFS (REQ) | | LAYOUTRETURN | OPT | pNFS (REQ) |
| LAYOUTERROR | OPT | pNFS (OPT) | | LAYOUTERROR | OPT | pNFS (OPT) |
| LAYOUTSTATS | OPT | pNFS (OPT) | | LAYOUTSTATS | OPT | pNFS (OPT) |
| LINK | OPT | | | LINK | OPT | |
| LOCK | REQ | | | LOCK | REQ | |
| LOCKT | REQ | | | LOCKT | REQ | |
skipping to change at page 59, line 37 skipping to change at page 61, line 29
| CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS | | CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
| CB_RECALL_SLOT | REQ | | | CB_RECALL_SLOT | REQ | |
| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) | | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) |
| CB_SEQUENCE | OPT | FDELG, DDELG, pNFS | | CB_SEQUENCE | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
| CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS | | CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
+-------------------------+-------------------+---------------------+ +-------------------------+-------------------+---------------------+
15. NFSv4.2 Operations 15. Modifications to NFSv4.1 Operations
15.1. Operation 59: ALLOCATE - Reserve Space in A Region of a File 15.1. Operation 42: EXCHANGE_ID - Instantiate Client ID
15.1.1. ARGUMENT 15.1.1. ARGUMENT
/* new */
const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004;
15.1.2. RESULT
Unchanged
15.1.3. MOTIVATION
Enterprise applications require guarantees that an operation has
either aborted or completed. NFSv4.1 provides this guarantee as long
as the session is alive: simply send a SEQUENCE operation on the same
slot with a new sequence number, and the successful return of
SEQUENCE indicates the previous operation has completed. However, if
the session is lost, there is no way to know when any in progress
operations have aborted or completed. In hindsight, the NFSv4.1
specification should have mandated that DESTROY_SESSION either abort
or complete all outstanding operations.
15.1.4. DESCRIPTION
A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
when it sends an EXCHANGE_ID operation. The server SHOULD set this
capability in the EXCHANGE_ID reply whether the client requests it or
not. It is the server's return that determines whether this
capability is in effect. When it is in effect, the following will
occur:
o The server will not reply to any DESTROY_SESSION invoked with the
client ID until all operations in progress are completed or
aborted.
o The server will not reply to subsequent EXCHANGE_ID invoked on the
same client owner with a new verifier until all operations in
progress on the client ID's session are completed or aborted.
o The NFS server SHOULD support client ID trunking, and if it does
and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
session ID created on one node of the storage cluster MUST be
destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID
and an EXCHANGE_ID with a new verifier affects all sessions
regardless what node the sessions were created on.
15.2. Operation 48: GETDEVICELIST - Get All Device Mappings for a File
System
15.2.1. ARGUMENT
struct GETDEVICELIST4args {
/* CURRENT_FH: object belonging to the file system */
layouttype4 gdla_layout_type;
/* number of deviceIDs to return */
count4 gdla_maxdevices;
nfs_cookie4 gdla_cookie;
verifier4 gdla_cookieverf;
};
15.2.2. RESULT
struct GETDEVICELIST4resok {
nfs_cookie4 gdlr_cookie;
verifier4 gdlr_cookieverf;
deviceid4 gdlr_deviceid_list<>;
bool gdlr_eof;
};
union GETDEVICELIST4res switch (nfsstat4 gdlr_status) {
case NFS4_OK:
GETDEVICELIST4resok gdlr_resok4;
default:
void;
};
15.2.3. MOTIVATION
The GETDEVICELIST operation was introduced in [RFC5661] specificly to
request a list of devices at filesystem mount time from block layout
type servers. However use of the GETDEVICELIST operation introduces
a race condition versus notification about changes to pNFS device IDs
as provided by CB_NOTIFY_DEVICEID. Implementation experience with
block layout servers has shown there is no need for GETDEVICELIST.
Clients have to be able to request new devices using GETDEVICEINFO at
any time in response either to a new deviceid in LAYOUTGET results or
to the CB_NOTIFY_DEVICEID callback operation.
15.2.4. DESCRIPTION
Clients and servers MUST NOT implement the GETDEVICELIST operation.
16. NFSv4.2 Operations
16.1. Operation 59: ALLOCATE - Reserve Space in A Region of a File
16.1.1. ARGUMENT
struct ALLOCATE4args { struct ALLOCATE4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 aa_stateid; stateid4 aa_stateid;
offset4 aa_offset; offset4 aa_offset;
length4 aa_length; length4 aa_length;
}; };
15.1.2. RESULT 16.1.2. RESULT
struct ALLOCATE4res { struct ALLOCATE4res {
nfsstat4 ar_status; nfsstat4 ar_status;
}; };
15.1.3. DESCRIPTION 16.1.3. DESCRIPTION
Whenever a client wishes to reserve space for a region in a file it Whenever a client wishes to reserve space for a region in a file it
calls the ALLOCATE operation with the current filehandle set to the calls the ALLOCATE operation with the current filehandle set to the
filehandle of the file in question, and the start offset and length filehandle of the file in question, and the start offset and length
in bytes of the region set in aa_offset and aa_length respectively. in bytes of the region set in aa_offset and aa_length respectively.
The server will ensure that backing blocks are reserved to the region The server will ensure that backing blocks are reserved to the region
specified by aa_offset and aa_length, and that no future writes into specified by aa_offset and aa_length, and that no future writes into
this region will return NFS4ERR_NOSPC. If the region lies partially this region will return NFS4ERR_NOSPC. If the region lies partially
or fully outside the current file size the file size will be set to or fully outside the current file size the file size will be set to
skipping to change at page 60, line 40 skipping to change at page 64, line 40
It is not required that the server allocate the space to the file It is not required that the server allocate the space to the file
before returning success. The allocation can be deferred, however, before returning success. The allocation can be deferred, however,
it must be guaranteed that it will not fail for lack of space. The it must be guaranteed that it will not fail for lack of space. The
deferral does not result in an asynchronous reply. deferral does not result in an asynchronous reply.
The ALLOCATE operation will result in the space_used attribute and The ALLOCATE operation will result in the space_used attribute and
space_freed attributes being increased by the number of bytes space_freed attributes being increased by the number of bytes
reserved unless they were previously reserved or written and not reserved unless they were previously reserved or written and not
shared. shared.
15.2. Operation 60: COPY - Initiate a server-side copy 16.2. Operation 60: COPY - Initiate a server-side copy
15.2.1. ARGUMENT 16.2.1. ARGUMENT
struct COPY4args { struct COPY4args {
/* SAVED_FH: source file */ /* SAVED_FH: source file */
/* CURRENT_FH: destination file */ /* CURRENT_FH: destination file */
stateid4 ca_src_stateid; stateid4 ca_src_stateid;
stateid4 ca_dst_stateid; stateid4 ca_dst_stateid;
offset4 ca_src_offset; offset4 ca_src_offset;
offset4 ca_dst_offset; offset4 ca_dst_offset;
length4 ca_count; length4 ca_count;
bool ca_consecutive; bool ca_consecutive;
bool ca_synchronous; bool ca_synchronous;
netloc4 ca_source_server<>; netloc4 ca_source_server<>;
}; };
15.2.2. RESULT 16.2.2. RESULT
struct write_response4 { struct write_response4 {
stateid4 wr_callback_id<1>; stateid4 wr_callback_id<1>;
length4 wr_count; length4 wr_count;
stable_how4 wr_committed; stable_how4 wr_committed;
verifier4 wr_writeverf; verifier4 wr_writeverf;
}; };
struct COPY4res { struct COPY4res {
nfsstat4 cr_status; nfsstat4 cr_status;
write_response4 cr_response; write_response4 cr_response;
bool cr_consecutive; bool cr_consecutive;
bool cr_synchronous; bool cr_synchronous;
}; };
15.2.3. DESCRIPTION 16.2.3. DESCRIPTION
The COPY operation is used for both intra-server and inter-server The COPY operation is used for both intra-server and inter-server
copies. In both cases, the COPY is always sent from the client to copies. In both cases, the COPY is always sent from the client to
the destination server of the file copy. The COPY operation requests the destination server of the file copy. The COPY operation requests
that a file be copied from the location specified by the SAVED_FH that a file be copied from the location specified by the SAVED_FH
value to the location specified by the CURRENT_FH. value to the location specified by the CURRENT_FH.
The SAVED_FH must be a regular file. If SAVED_FH is not a regular The SAVED_FH must be a regular file. If SAVED_FH is not a regular
file, the operation MUST fail and return NFS4ERR_WRONG_TYPE. file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.
skipping to change at page 65, line 8 skipping to change at page 69, line 8
asynchronously, the data copied from the source file to the asynchronously, the data copied from the source file to the
destination file MUST appear identical to the NFS client. However, destination file MUST appear identical to the NFS client. However,
the NFS server's on disk representation of the data in the source the NFS server's on disk representation of the data in the source
file and destination file MAY differ. For example, the NFS server file and destination file MAY differ. For example, the NFS server
might encrypt, compress, deduplicate, or otherwise represent the on might encrypt, compress, deduplicate, or otherwise represent the on
disk data in the source and destination file differently. disk data in the source and destination file differently.
If a failure does occur for a synchronous copy, wr_count will be set If a failure does occur for a synchronous copy, wr_count will be set
to the number of bytes copied to the destination file before the to the number of bytes copied to the destination file before the
error occurred. If cr_consecutive is true, then the bytes were error occurred. If cr_consecutive is true, then the bytes were
copied in order. If the failure occured for an asynchronous copy, copied in order. If the failure occurred for an asynchronous copy,
then the client will have gotten the notification of the consecutive then the client will have gotten the notification of the consecutive
copy order when it got the copy stateid. It will be able to copy order when it got the copy stateid. It will be able to
determine the bytes copied from the coa_bytes_copied in the determine the bytes copied from the coa_bytes_copied in the
CB_OFFLOAD argument. CB_OFFLOAD argument.
In either case, if cr_consecutive was not true, there is no assurance In either case, if cr_consecutive was not true, there is no assurance
as to exactly which bytes in the range were copied. The client MUST as to exactly which bytes in the range were copied. The client MUST
assume that there exists a mixture of the original contents of the assume that there exists a mixture of the original contents of the
range and the new bytes. If the COPY wrote past the end of the file range and the new bytes. If the COPY wrote past the end of the file
on the destination, then the last byte written to will determine the on the destination, then the last byte written to will determine the
new file size. The contents of any block not written to and past the new file size. The contents of any block not written to and past the
original size of the file will be as if a normal WRITE extended the original size of the file will be as if a normal WRITE extended the
file. file.
15.3. Operation 61: COPY_NOTIFY - Notify a source server of a future 16.3. Operation 61: COPY_NOTIFY - Notify a source server of a future
copy copy
15.3.1. ARGUMENT 16.3.1. ARGUMENT
struct COPY_NOTIFY4args { struct COPY_NOTIFY4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: source file */
stateid4 cna_src_stateid; stateid4 cna_src_stateid;
netloc4 cna_destination_server; netloc4 cna_destination_server;
}; };
15.3.2. RESULT 16.3.2. RESULT
struct COPY_NOTIFY4resok { struct COPY_NOTIFY4resok {
nfstime4 cnr_lease_time; nfstime4 cnr_lease_time;
netloc4 cnr_source_server<>; netloc4 cnr_source_server<>;
}; };
union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
case NFS4_OK: case NFS4_OK:
COPY_NOTIFY4resok resok4; COPY_NOTIFY4resok resok4;
default: default:
void; void;
}; };
15.3.3. DESCRIPTION 16.3.3. DESCRIPTION
This operation is used for an inter-server copy. A client sends this This operation is used for an inter-server copy. A client sends this
operation in a COMPOUND request to the source server to authorize a operation in a COMPOUND request to the source server to authorize a
destination server identified by cna_destination_server to read the destination server identified by cna_destination_server to read the
file specified by CURRENT_FH on behalf of the given user. file specified by CURRENT_FH on behalf of the given user.
The cna_src_stateid MUST refer to either open or locking states The cna_src_stateid MUST refer to either open or locking states
provided earlier by the server. If it is invalid, then the operation provided earlier by the server. If it is invalid, then the operation
MUST fail. MUST fail.
skipping to change at page 66, line 47 skipping to change at page 70, line 47
A successful response will also contain a list of netloc4 network A successful response will also contain a list of netloc4 network
location formats called cnr_source_server, on which the source is location formats called cnr_source_server, on which the source is
willing to accept connections from the destination. These might not willing to accept connections from the destination. These might not
be reachable from the client and might be located on networks to be reachable from the client and might be located on networks to
which the client has no connection. which the client has no connection.
For a copy only involving one server (the source and destination are For a copy only involving one server (the source and destination are
on the same server), this operation is unnecessary. on the same server), this operation is unnecessary.
15.4. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID 16.4. Operation 62: DEALLOCATE - Unreserve Space in a Region of a File
15.4.1. ARGUMENT 16.4.1. ARGUMENT
/* new */
const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004;
15.4.2. RESULT
Unchanged
15.4.3. MOTIVATION
Enterprise applications require guarantees that an operation has
either aborted or completed. NFSv4.1 provides this guarantee as long
as the session is alive: simply send a SEQUENCE operation on the same
slot with a new sequence number, and the successful return of
SEQUENCE indicates the previous operation has completed. However, if
the session is lost, there is no way to know when any in progress
operations have aborted or completed. In hindsight, the NFSv4.1
specification should have mandated that DESTROY_SESSION either abort
or complete all outstanding operations.
15.4.4. DESCRIPTION
A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
when it sends an EXCHANGE_ID operation. The server SHOULD set this
capability in the EXCHANGE_ID reply whether the client requests it or
not. It is the server's return that determines whether this
capability is in effect. When it is in effect, the following will
occur:
o The server will not reply to any DESTROY_SESSION invoked with the
client ID until all operations in progress are completed or
aborted.
o The server will not reply to subsequent EXCHANGE_ID invoked on the
same client owner with a new verifier until all operations in
progress on the client ID's session are completed or aborted.
o The NFS server SHOULD support client ID trunking, and if it does
and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
session ID created on one node of the storage cluster MUST be
destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID
and an EXCHANGE_ID with a new verifier affects all sessions
regardless what node the sessions were created on.
15.5. Operation 62: DEALLOCATE - Unreserve Space in a Region of a File
15.5.1. ARGUMENT
struct DEALLOCATE4args { struct DEALLOCATE4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 da_stateid; stateid4 da_stateid;
offset4 da_offset; offset4 da_offset;
length4 da_length; length4 da_length;
}; };
15.5.2. RESULT 16.4.2. RESULT
struct DEALLOCATE4res { struct DEALLOCATE4res {
nfsstat4 dr_status; nfsstat4 dr_status;
}; };
15.5.3. DESCRIPTION 16.4.3. DESCRIPTION
Whenever a client wishes to unreserve space for a region in a file it Whenever a client wishes to unreserve space for a region in a file it
calls the DEALLOCATE operation with the current filehandle set to the calls the DEALLOCATE operation with the current filehandle set to the
filehandle of the file in question, and the start offset and length filehandle of the file in question, and the start offset and length
in bytes of the region set in aa_offset and aa_length respectively. in bytes of the region set in da_offset and da_length respectively.
If no space was allocated or reserved for all or parts of the region, If no space was allocated or reserved for all or parts of the region,
the DEALLOCATE operation will have no effect for the region that the DEALLOCATE operation will have no effect for the region that
already is in unreserved state. All further reads from the region already is in unreserved state. All further reads from the region
passed to DEALLOCATE MUST return zeros until overwritten. The passed to DEALLOCATE MUST return zeros until overwritten. The
filehandle specified must be that of a regular file. filehandle specified must be that of a regular file.
Situations may arise where da_offset and/or da_offset + da_length Situations may arise where da_offset and/or da_offset + da_length
will not be aligned to a boundary for which the server does will not be aligned to a boundary for which the server does
allocations or deallocations. For most file systems, this is the allocations or deallocations. For most file systems, this is the
block size of the file system. In such a case, the server can block size of the file system. In such a case, the server can
deallocate as many bytes as it can in the region. The blocks that deallocate as many bytes as it can in the region. The blocks that
cannot be deallocated MUST be zeroed. cannot be deallocated MUST be zeroed.
DEALLOCATE will result in the space_used attribute being decreased by DEALLOCATE will result in the space_used attribute being decreased by
the number of bytes that were deallocated. The space_freed attribute the number of bytes that were deallocated. The space_freed attribute
may or may not decrease, depending on the support and whether the may or may not decrease, depending on the support and whether the
blocks backing the specified range were shared or not. The size blocks backing the specified range were shared or not. The size
attribute will remain unchanged. attribute will remain unchanged.
15.6. Operation 63: IO_ADVISE - Application I/O access pattern hints 16.5. Operation 63: IO_ADVISE - Application I/O access pattern hints
16.5.1. ARGUMENT
15.6.1. ARGUMENT
enum IO_ADVISE_type4 { enum IO_ADVISE_type4 {
IO_ADVISE4_NORMAL = 0, IO_ADVISE4_NORMAL = 0,
IO_ADVISE4_SEQUENTIAL = 1, IO_ADVISE4_SEQUENTIAL = 1,
IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2, IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2,
IO_ADVISE4_RANDOM = 3, IO_ADVISE4_RANDOM = 3,
IO_ADVISE4_WILLNEED = 4, IO_ADVISE4_WILLNEED = 4,
IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5, IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5,
IO_ADVISE4_DONTNEED = 6, IO_ADVISE4_DONTNEED = 6,
IO_ADVISE4_NOREUSE = 7, IO_ADVISE4_NOREUSE = 7,
skipping to change at page 69, line 31 skipping to change at page 72, line 28
}; };
struct IO_ADVISE4args { struct IO_ADVISE4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 iaa_stateid; stateid4 iaa_stateid;
offset4 iaa_offset; offset4 iaa_offset;
length4 iaa_count; length4 iaa_count;
bitmap4 iaa_hints; bitmap4 iaa_hints;
}; };
15.6.2. RESULT 16.5.2. RESULT
struct IO_ADVISE4resok { struct IO_ADVISE4resok {
bitmap4 ior_hints; bitmap4 ior_hints;
}; };
union IO_ADVISE4res switch (nfsstat4 ior_status) { union IO_ADVISE4res switch (nfsstat4 ior_status) {
case NFS4_OK: case NFS4_OK:
IO_ADVISE4resok resok4; IO_ADVISE4resok resok4;
default: default:
void; void;
}; };
15.6.3. DESCRIPTION 16.5.3. DESCRIPTION
The IO_ADVISE operation sends an I/O access pattern hint to the The IO_ADVISE operation sends an I/O access pattern hint to the
server for the owner of the stateid for a given byte range specified server for the owner of the stateid for a given byte range specified
by iar_offset and iar_count. The byte range specified by iaa_offset by iar_offset and iar_count. The byte range specified by iaa_offset
and iaa_count need not currently exist in the file, but the iaa_hints and iaa_count need not currently exist in the file, but the iaa_hints
will apply to the byte range when it does exist. If iaa_count is 0, will apply to the byte range when it does exist. If iaa_count is 0,
all data following iaa_offset is specified. The server MAY ignore all data following iaa_offset is specified. The server MAY ignore
the advice. the advice.
The following are the allowed hints for a stateid holder: The following are the allowed hints for a stateid holder:
skipping to change at page 71, line 32 skipping to change at page 74, line 30
perhaps due to a temporary resource limitation. perhaps due to a temporary resource limitation.
Each issuance of the IO_ADVISE operation overrides all previous Each issuance of the IO_ADVISE operation overrides all previous
issuances of IO_ADVISE for a given byte range. This effectively issuances of IO_ADVISE for a given byte range. This effectively
follows a strategy of last hint wins for a given stateid and byte follows a strategy of last hint wins for a given stateid and byte
range. range.
Clients should assume that hints included in an IO_ADVISE operation Clients should assume that hints included in an IO_ADVISE operation
will be forgotten once the file is closed. will be forgotten once the file is closed.
15.6.4. IMPLEMENTATION 16.5.4. IMPLEMENTATION
The NFS client may choose to issue an IO_ADVISE operation to the The NFS client may choose to issue an IO_ADVISE operation to the
server in several different instances. server in several different instances.
The most obvious is in direct response to an application's execution The most obvious is in direct response to an application's execution
of posix_fadvise(). In this case, IO_ADVISE4_WRITE and of posix_fadvise(). In this case, IO_ADVISE4_WRITE and
IO_ADVISE4_READ may be set based upon the type of file access IO_ADVISE4_READ may be set based upon the type of file access
specified when the file was opened. specified when the file was opened.
15.6.5. IO_ADVISE4_INIT_PROXIMITY 16.5.5. IO_ADVISE4_INIT_PROXIMITY
The IO_ADVISE4_INIT_PROXIMITY hint is non-posix in origin and can be The IO_ADVISE4_INIT_PROXIMITY hint is non-posix in origin and can be
used to convey that the client has recently accessed the byte range used to convey that the client has recently accessed the byte range
in its own cache. I.e., it has not accessed it on the server, but it in its own cache. I.e., it has not accessed it on the server, but it
has locally. When the server reaches resource exhaustion, knowing has locally. When the server reaches resource exhaustion, knowing
which data is more important allows the server to make better choices which data is more important allows the server to make better choices
about which data to, for example purge from a cache, or move to about which data to, for example purge from a cache, or move to
secondary storage. It also informs the server which delegations are secondary storage. It also informs the server which delegations are
more important, since if delegations are working correctly, once more important, since if delegations are working correctly, once
delegated to a client and the client has read the content for that delegated to a client and the client has read the content for that
skipping to change at page 72, line 18 skipping to change at page 75, line 15
The IO_ADVISE4_INIT_PROXIMITY hint can also be used in a pNFS setting The IO_ADVISE4_INIT_PROXIMITY hint can also be used in a pNFS setting
to let the client inform the metadata server as to the I/O statistics to let the client inform the metadata server as to the I/O statistics
between the client and the storage devices. The metadata server is between the client and the storage devices. The metadata server is
then free to use this information about client I/O to optimize the then free to use this information about client I/O to optimize the
data storage location. data storage location.
This hint is also useful in the case of NFS clients which are network This hint is also useful in the case of NFS clients which are network
booting from a server. If the first client to be booted sends this booting from a server. If the first client to be booted sends this
hint, then it keeps the cache warm for the remaining clients. hint, then it keeps the cache warm for the remaining clients.
15.6.6. pNFS File Layout Data Type Considerations 16.5.6. pNFS File Layout Data Type Considerations
The IO_ADVISE considerations for pNFS are very similar to the COMMIT The IO_ADVISE considerations for pNFS are very similar to the COMMIT
considerations for pNFS. That is, as with COMMIT, some NFS server considerations for pNFS. That is, as with COMMIT, some NFS server
implementations prefer IO_ADVISE be done on the DS, and some prefer implementations prefer IO_ADVISE be done on the DS, and some prefer
it be done on the MDS. it be done on the MDS.
For the file's layout type, it is proposed that NFSv4.2 include an For the file's layout type, it is proposed that NFSv4.2 include an
additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on
metadata servers running NFSv4.2 or higher. Any file's layout metadata servers running NFSv4.2 or higher. Any file's layout
obtained from a NFSv4.1 metadata server MUST NOT have obtained from a NFSv4.1 metadata server MUST NOT have
skipping to change at page 73, line 5 skipping to change at page 75, line 47
send an IO_ADVISE operation to the appropriate DS for the specified send an IO_ADVISE operation to the appropriate DS for the specified
byte range. While the client MAY always send IO_ADVISE to the MDS, byte range. While the client MAY always send IO_ADVISE to the MDS,
if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the client if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the client
should expect that such an IO_ADVISE is futile. Note that a client should expect that such an IO_ADVISE is futile. Note that a client
SHOULD use the same set of arguments on each IO_ADVISE sent to a DS SHOULD use the same set of arguments on each IO_ADVISE sent to a DS
for the same open file reference. for the same open file reference.
The server is not required to support different advice for different The server is not required to support different advice for different
DS's with the same open file reference. DS's with the same open file reference.
15.6.6.1. Dense and Sparse Packing Considerations 16.5.6.1. Dense and Sparse Packing Considerations
The IO_ADVISE operation MUST use the iar_offset and byte range as The IO_ADVISE operation MUST use the iar_offset and byte range as
dictated by the presence or absence of NFL4_UFLG_DENSE. dictated by the presence or absence of NFL4_UFLG_DENSE.
E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS
for iaa_offset 0 really means iaa_offset 10000 in the logical file, for iaa_offset 0 really means iaa_offset 10000 in the logical file,
then an IO_ADVISE for iaa_offset 0 means iaa_offset 10000. then an IO_ADVISE for iaa_offset 0 means iaa_offset 10000.
E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS
for iaa_offset 0 really means iaa_offset 0 in the logical file, then for iaa_offset 0 really means iaa_offset 0 in the logical file, then
skipping to change at page 74, line 25 skipping to change at page 77, line 25
If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and
NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request
sent to the data server with a byte range that overlaps stripe unit sent to the data server with a byte range that overlaps stripe unit
that the data server does not serve MUST NOT result in the status that the data server does not serve MUST NOT result in the status
NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and
if the server applies IO_ADVISE hints on any stripe units that if the server applies IO_ADVISE hints on any stripe units that
overlap with the specified range, those hints SHOULD be indicated in overlap with the specified range, those hints SHOULD be indicated in
the response. the response.
15.7. Operation 64: LAYOUTERROR - Provide Errors for the Layout 16.6. Operation 64: LAYOUTERROR - Provide Errors for the Layout
15.7.1. ARGUMENT
struct layoutupdate4 { 16.6.1. ARGUMENT
layouttype4 lou_type;
opaque lou_body<>;
};
struct device_error4 { struct device_error4 {
deviceid4 de_deviceid; deviceid4 de_deviceid;
nfsstat4 de_status; nfsstat4 de_status;
nfs_opnum4 de_opnum; nfs_opnum4 de_opnum;
}; };
struct LAYOUTERROR4args { struct LAYOUTERROR4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
offset4 lea_offset; offset4 lea_offset;
length4 lea_length; length4 lea_length;
stateid4 lea_stateid; stateid4 lea_stateid;
device_error4 lea_errors; device_error4 lea_errors;
}; };
15.7.2. RESULT 16.6.2. RESULT
struct LAYOUTERROR4res { struct LAYOUTERROR4res {
nfsstat4 ler_status; nfsstat4 ler_status;
}; };
15.7.3. DESCRIPTION 16.6.3. DESCRIPTION
The client can use LAYOUTERROR to inform the metadata server about The client can use LAYOUTERROR to inform the metadata server about
errors in its interaction with the layout represented by the current errors in its interaction with the layout represented by the current
filehandle, client ID (derived from the session ID in the preceding filehandle, client ID (derived from the session ID in the preceding
SEQUENCE operation), byte-range (lea_offset + lea_length), and SEQUENCE operation), byte-range (lea_offset + lea_length), and
lea_stateid. lea_stateid.
Each individual device_error4 describes a single error associated Each individual device_error4 describes a single error associated
with a storage device, which is identified via de_deviceid. If the with a storage device, which is identified via de_deviceid. If the
Layout Type supports NFSv4 operations, then the operation which Layout Type supports NFSv4 operations, then the operation which
returned the error is identified via de_opnum. If the Layout Type returned the error is identified via de_opnum. If the Layout Type
does not support NFSv4 operations, then it MAY chose to either map does not support NFSv4 operations, then it MAY chose to either map
the operation onto one of the allowed operations which can be sent to the operation onto one of the allowed operations which can be sent to
a storage device with the File Layout Type (see Section 3.3) or it a storage device with the File Layout Type (see Section 3.3) or it
can signal no support for operations by marking de_opnum with the can signal no support for operations by marking de_opnum with the
ILEGAL operation. Finally the NFS error value (nfsstat4) encountered ILLEGAL operation. Finally the NFS error value (nfsstat4)
is provided via de_status and may consist of the following error encountered is provided via de_status and may consist of the
codes: following error codes:
NFS4ERR_NXIO: The client was unable to establish any communication NFS4ERR_NXIO: The client was unable to establish any communication
with the storage device. with the storage device.
NFS4ERR_*: The client was able to establish communication with the NFS4ERR_*: The client was able to establish communication with the
storage device and is returning one of the allowed error codes for storage device and is returning one of the allowed error codes for
the operation denoted by de_opnum. the operation denoted by de_opnum.
Note that while the metadata server may return an error associated Note that while the metadata server may return an error associated
with the layout stateid or the open file, it MUST NOT return an error with the layout stateid or the open file, it MUST NOT return an error
in the processing of the errors. If LAYOUTERROR is in a compound in the processing of the errors. If LAYOUTERROR is in a compound
before LAYOUTRETURN, it MUST NOT introduce an error other than what before LAYOUTRETURN, it MUST NOT introduce an error other than what
LAYOUTRETURN would already encounter. LAYOUTRETURN would already encounter.
15.7.4. IMPLEMENTATION 16.6.4. IMPLEMENTATION
There are two broad classes of errors, transient and persistent. The There are two broad classes of errors, transient and persistent. The
client SHOULD strive to only use this new mechanism to report client SHOULD strive to only use this new mechanism to report
persistent errors. It MUST be able to deal with transient issues by persistent errors. It MUST be able to deal with transient issues by
itself. Also, while the client might consider an issue to be itself. Also, while the client might consider an issue to be
persistent, it MUST be prepared for the metadata server to consider persistent, it MUST be prepared for the metadata server to consider
such issues to be transient. A prime example of this is if the such issues to be transient. A prime example of this is if the
metadata server fences off a client from either a stateid or a metadata server fences off a client from either a stateid or a
filehandle. The client will get an error from the storage device and filehandle. The client will get an error from the storage device and
might relay either NFS4ERR_ACCESS or NFS4ERR_BAD_STATEID back to the might relay either NFS4ERR_ACCESS or NFS4ERR_BAD_STATEID back to the
skipping to change at page 77, line 23 skipping to change at page 80, line 15
metadata server might not have any choice in using the storage metadata server might not have any choice in using the storage
device, i.e., there might only be one possible layout for the system. device, i.e., there might only be one possible layout for the system.
Also, in the case of existing files, the metadata server might have Also, in the case of existing files, the metadata server might have
no choice in which storage devices to hand out to clients. no choice in which storage devices to hand out to clients.
The metadata server is not required to indefinitely retain per-client The metadata server is not required to indefinitely retain per-client
storage device error information. An metadata server is also not storage device error information. An metadata server is also not
required to automatically reinstate use of a previously problematic required to automatically reinstate use of a previously problematic
storage device; administrative intervention may be required instead. storage device; administrative intervention may be required instead.
15.8. Operation 65: LAYOUTSTATS - Provide Statistics for the Layout 16.7. Operation 65: LAYOUTSTATS - Provide Statistics for the Layout
15.8.1. ARGUMENT 16.7.1. ARGUMENT
struct layoutupdate4 { struct layoutupdate4 {
layouttype4 lou_type; layouttype4 lou_type;
opaque lou_body<>; opaque lou_body<>;
}; };
struct io_info4 { struct io_info4 {
uint32_t ii_count; uint32_t ii_count;
uint64_t ii_bytes; uint64_t ii_bytes;
}; };
skipping to change at page 77, line 47 skipping to change at page 80, line 39
struct LAYOUTSTATS4args { struct LAYOUTSTATS4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
offset4 lsa_offset; offset4 lsa_offset;
length4 lsa_length; length4 lsa_length;
stateid4 lsa_stateid; stateid4 lsa_stateid;
io_info4 lsa_read; io_info4 lsa_read;
io_info4 lsa_write; io_info4 lsa_write;
layoutupdate4 lsa_layoutupdate; layoutupdate4 lsa_layoutupdate;
}; };
15.8.2. RESULT 16.7.2. RESULT
struct LAYOUTSTATS4res { struct LAYOUTSTATS4res {
nfsstat4 lsr_status; nfsstat4 lsr_status;
}; };
15.8.3. DESCRIPTION 16.7.3. DESCRIPTION
The client can use LAYOUTSTATS to inform the metadata server about The client can use LAYOUTSTATS to inform the metadata server about
its interaction with the layout represented by the current its interaction with the layout represented by the current
filehandle, client ID (derived from the session ID in the preceding filehandle, client ID (derived from the session ID in the preceding
SEQUENCE operation), byte-range (lea_offset + lea_length), and SEQUENCE operation), byte-range (lea_offset + lea_length), and
lea_stateid. lsa_read and lsa_write allow for non-Layout Type lea_stateid. lsa_read and lsa_write allow for non-Layout Type
specific statistices to be reported. The remaining information the specific statistics to be reported. The remaining information the
client is presenting is specific to the Layout Type and presented in client is presenting is specific to the Layout Type and presented in
the lea_layoutupdate field. Each Layout Type MUST define the the lea_layoutupdate field. Each Layout Type MUST define the
contents of lea_layoutupdate in their respective specifications. contents of lea_layoutupdate in their respective specifications.
LAYOUTSTATS can be combined with IO_ADVISE (see Section 15.6) to LAYOUTSTATS can be combined with IO_ADVISE (see Section 16.5) to
augment the decision making process of how the metadata server augment the decision making process of how the metadata server
handles a file. I.e., IO_ADVISE lets the server know that a byte handles a file. I.e., IO_ADVISE lets the server know that a byte
range has a certain characteristic, but not necessarily the intensity range has a certain characteristic, but not necessarily the intensity
of that characteristic. of that characteristic.
The client MUST reset the statistics after getting a successfully
reply from the metadata server. The first LAYOUTSTATS sent by the
client SHOULD be from the opening of the file. The choice of how
often to update the metadata server is made by the client.
Note that while the metadata server may return an error associated Note that while the metadata server may return an error associated
with the layout stateid or the open file, it MUST NOT return an error with the layout stateid or the open file, it MUST NOT return an error
in the processing of the statistics. in the processing of the statistics.
15.9. Operation 66: OFFLOAD_CANCEL - Stop an Offloaded Operation 16.8. Operation 66: OFFLOAD_CANCEL - Stop an Offloaded Operation
15.9.1. ARGUMENT 16.8.1. ARGUMENT
struct OFFLOAD_CANCEL4args { struct OFFLOAD_CANCEL4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: file to cancel */
stateid4 oca_stateid; stateid4 oca_stateid;
}; };
15.9.2. RESULT 16.8.2. RESULT
struct OFFLOAD_CANCEL4res { struct OFFLOAD_CANCEL4res {
nfsstat4 ocr_status; nfsstat4 ocr_status;
}; };
15.9.3. DESCRIPTION 16.8.3. DESCRIPTION
OFFLOAD_CANCEL is used by the client to terminate an asynchronous OFFLOAD_CANCEL is used by the client to terminate an asynchronous
operation, which is identifed both by CURRENT_FH and the oca_stateid. operation, which is identified both by CURRENT_FH and the
I.e., there can be multiple offloaded operations acting on the file, oca_stateid. I.e., there can be multiple offloaded operations acting
the stateid will identify to the server exactly which one is to be on the file, the stateid will identify to the server exactly which
stopped. one is to be stopped. Currently there are only two operations which
can decide to be asynchronous: COPY and WRITE_SAME.
In the context of server-to-server copy, the client can send In the context of server-to-server copy, the client can send
OFFLOAD_CANCEL to either the source or destination server, albeit OFFLOAD_CANCEL to either the source or destination server, albeit
with a different stateid. The client uses OFFLOAD_CANCEL to inform with a different stateid. The client uses OFFLOAD_CANCEL to inform
the destination to stop the active transfer and uses the stateid it the destination to stop the active transfer and uses the stateid it
got back from the COPY operation. The client uses OFFLOAD_CANCEL and got back from the COPY operation. The client uses OFFLOAD_CANCEL and
the stateid it used in the COPY_NOTIFY to inform the source to not the stateid it used in the COPY_NOTIFY to inform the source to not
allow any more copying from the destination. allow any more copying from the destination.
OFFLOAD_CANCEL is also useful in situations in which the source OFFLOAD_CANCEL is also useful in situations in which the source
server granted a very long or infinite lease on the destination server granted a very long or infinite lease on the destination
server's ability to read the source file and all copy operations on server's ability to read the source file and all copy operations on
the source file have been completed. the source file have been completed.
15.10. Operation 67: OFFLOAD_STATUS - Poll for Status of Asynchronous 16.9. Operation 67: OFFLOAD_STATUS - Poll for Status of Asynchronous
Operation Operation
15.10.1. ARGUMENT 16.9.1. ARGUMENT
struct OFFLOAD_STATUS4args { struct OFFLOAD_STATUS4args {
/* CURRENT_FH: destination file */ /* CURRENT_FH: destination file */
stateid4 osa_stateid; stateid4 osa_stateid;
}; };
15.10.2. RESULT 16.9.2. RESULT
struct OFFLOAD_STATUS4resok { struct OFFLOAD_STATUS4resok {
length4 osr_count; length4 osr_count;
nfsstat4 osr_complete<1>; nfsstat4 osr_complete<1>;
}; };
union OFFLOAD_STATUS4res switch (nfsstat4 osr_status) { union OFFLOAD_STATUS4res switch (nfsstat4 osr_status) {
case NFS4_OK: case NFS4_OK:
OFFLOAD_STATUS4resok osr_resok4; OFFLOAD_STATUS4resok osr_resok4;
default: default:
void; void;
}; };
15.10.3. DESCRIPTION 16.9.3. DESCRIPTION
OFFLOAD_STATUS can be used by the client to query the progress of an OFFLOAD_STATUS can be used by the client to query the progress of an
asynchronous operation, which is identifed both by CURRENT_FH and the asynchronous operation, which is identified both by CURRENT_FH and
osa_stateid. If this operation is successful, the number of bytes the osa_stateid. If this operation is successful, the number of
processed are returned to the client in the osr_count field. bytes processed are returned to the client in the osr_count field.
If the optional osr_complete field is present, the asynchronous If the optional osr_complete field is present, the asynchronous
operation has completed. In this case the status value indicates the operation has completed. In this case the status value indicates the
result of the asynchronous operation. In all cases, the server will result of the asynchronous operation. In all cases, the server will
also deliver the final results of the asynchronous operation in a also deliver the final results of the asynchronous operation in a
CB_OFFLOAD operation. CB_OFFLOAD operation.
The failure of this operation does not indicate the result of the The failure of this operation does not indicate the result of the
asynchronous operation in any way. asynchronous operation in any way.
15.11. Operation 68: READ_PLUS - READ Data or Holes from a File 16.10. Operation 68: READ_PLUS - READ Data or Holes from a File
15.11.1. ARGUMENT 16.10.1. ARGUMENT
struct READ_PLUS4args { struct READ_PLUS4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 rpa_stateid; stateid4 rpa_stateid;
offset4 rpa_offset; offset4 rpa_offset;
count4 rpa_count; count4 rpa_count;
}; };
15.11.2. RESULT 16.10.2. RESULT
enum data_content4 {
NFS4_CONTENT_DATA = 0,
NFS4_CONTENT_HOLE = 1
};
struct data_info4 { struct data_info4 {
offset4 di_offset; offset4 di_offset;
length4 di_length; length4 di_length;
}; };
struct data4 { struct data4 {
offset4 d_offset; offset4 d_offset;
opaque d_data<>; opaque d_data<>;
}; };
skipping to change at page 81, line 19 skipping to change at page 84, line 28
read_plus_content rpr_contents<>; read_plus_content rpr_contents<>;
}; };
union READ_PLUS4res switch (nfsstat4 rp_status) { union READ_PLUS4res switch (nfsstat4 rp_status) {
case NFS4_OK: case NFS4_OK:
read_plus_res4 rp_resok4; read_plus_res4 rp_resok4;
default: default:
void; void;
}; };
15.11.3. DESCRIPTION 16.10.3. DESCRIPTION
The READ_PLUS operation is based upon the NFSv4.1 READ operation (see The READ_PLUS operation is based upon the NFSv4.1 READ operation (see
Section 18.22 of [RFC5661]) and similarly reads data from the regular Section 18.22 of [RFC5661]) and similarly reads data from the regular
file identified by the current filehandle. file identified by the current filehandle.
The client provides a rpa_offset of where the READ_PLUS is to start The client provides a rpa_offset of where the READ_PLUS is to start
and a rpa_count of how many bytes are to be read. A rpa_offset of and a rpa_count of how many bytes are to be read. A rpa_offset of
zero means to read data starting at the beginning of the file. If zero means to read data starting at the beginning of the file. If
rpa_offset is greater than or equal to the size of the file, the rpa_offset is greater than or equal to the size of the file, the
status NFS4_OK is returned with di_length (the data length) set to status NFS4_OK is returned with di_length (the data length) set to
zero and eof set to TRUE. zero and eof set to TRUE.
The READ_PLUS result is comprised of an array of rpr_contents, each The READ_PLUS result is comprised of an array of rpr_contents, each
of which describe a data_content4 type of data. For NFSv4.2, the of which describe a data_content4 type of data. For NFSv4.2, the
allowed values are data and hole. A server MUST support both the allowed values are data and hole. A server MUST support both the
data type and the hole if it uses READ_PLUS. If it does not want to data type and the hole if it uses READ_PLUS. If it does not want to
support a hole, it MUST use READ. The hole SHOULD be returned in its support a hole, it MUST use READ. The array contents MUST be
entirety - clients must be prepared to get more information than they contiguous in the file.
requested. Both the start and the end of the hole may exceed what
was requested. The array contents MUST be contiguous in the file.
If the data to be returned is comprised entirely of zeros, then the Holes SHOULD be returned in their entirety - clients must be prepared
server SHOULD return that data as a hole. The di_reserved field is to get more information than they requested. Both the start and the
used to tell the client if a hole is unreserved, that is writes to it end of the hole may exceed what was requested. If data to be
MAY return NFS4ERR_NOSPC, or it is reserved in which cases writes returned is comprised entirely of zeros, then the server SHOULD
into the hole MUST NOT return ENOSPC. If the server does not know return that data as a hole instead.
the reservations status it may set the di_reserved field to
SPACE_UNKNOWN4.
The server may elect to return adjacent elements of the same type. The server may elect to return adjacent elements of the same type.
For example, if the server has a range of data comprised entirely of For example, if the server has a range of data comprised entirely of
zeros and then a hole, it might want to return two adjacent holes to zeros and then a hole, it might want to return two adjacent holes to
the client. the client.
If the client specifies a rpa_count value of zero, the READ_PLUS If the client specifies a rpa_count value of zero, the READ_PLUS
succeeds and returns zero bytes of data. In all situations, the succeeds and returns zero bytes of data. In all situations, the
server may choose to return fewer bytes than specified by the client. server may choose to return fewer bytes than specified by the client.
The client needs to check for this condition and handle the condition The client needs to check for this condition and handle the condition
skipping to change at page 83, line 16 skipping to change at page 86, line 22
For a READ_PLUS with a stateid value of all bits equal to zero, the For a READ_PLUS with a stateid value of all bits equal to zero, the
server MAY allow the READ_PLUS to be serviced subject to mandatory server MAY allow the READ_PLUS to be serviced subject to mandatory
byte-range locks or the current share deny modes for the file. For a byte-range locks or the current share deny modes for the file. For a
READ_PLUS with a stateid value of all bits equal to one, the server READ_PLUS with a stateid value of all bits equal to one, the server
MAY allow READ_PLUS operations to bypass locking checks at the MAY allow READ_PLUS operations to bypass locking checks at the
server. server.
On success, the current filehandle retains its value. On success, the current filehandle retains its value.
15.11.3.1. Note on Client Support of Arms of the Union 16.10.3.1. Note on Client Support of Arms of the Union
It was decided not to add a means for the client to inform the server It was decided not to add a means for the client to inform the server
as to which arms of READ_PLUS it would support. In a later minor as to which arms of READ_PLUS it would support. In a later minor
version, it may become necessary for the introduction of a new version, it may become necessary for the introduction of a new
operation which would allow the client to inform the server as to operation which would allow the client to inform the server as to
whether it supported the new arms of the union of data types whether it supported the new arms of the union of data types
available in READ_PLUS. available in READ_PLUS.
15.11.4. IMPLEMENTATION 16.10.4. IMPLEMENTATION
In general, the IMPLEMENTATION notes for READ in Section 18.22.4 of In general, the IMPLEMENTATION notes for READ in Section 18.22.4 of
[RFC5661] also apply to READ_PLUS. [RFC5661] also apply to READ_PLUS.
15.11.4.1. Additional pNFS Implementation Information 16.10.4.1. Additional pNFS Implementation Information
With pNFS, the semantics of using READ_PLUS remains the same. Any With pNFS, the semantics of using READ_PLUS remains the same. Any
data server MAY return a hole result for a READ_PLUS request that it data server MAY return a hole result for a READ_PLUS request that it
receives. When a data server chooses to return such a result, it has receives. When a data server chooses to return such a result, it has
the option of returning information for the data stored on that data the option of returning information for the data stored on that data
server (as defined by the data layout), but it MUST NOT return server (as defined by the data layout), but it MUST NOT return
results for a byte range that includes data managed by another data results for a byte range that includes data managed by another data
server. server.
If mandatory locking is enforced, then the data server must also If mandatory locking is enforced, then the data server must also
ensure that to return only information that is within the owner's ensure that to return only information that is within the owner's
locked byte range. locked byte range.
15.11.5. READ_PLUS with Sparse Files Example 16.10.5. READ_PLUS with Sparse Files Example
The following table describes a sparse file. For each byte range, The following table describes a sparse file. For each byte range,
the file contains either non-zero data or a hole. In addition, the the file contains either non-zero data or a hole. In addition, the
server in this example will only create a hole if it is greater than server in this example will only create a hole if it is greater than
32K. 32K.
+-------------+----------+ +-------------+----------+
| Byte-Range | Contents | | Byte-Range | Contents |
+-------------+----------+ +-------------+----------+
| 0-15999 | Hole | | 0-15999 | Hole |
skipping to change at page 84, line 45 skipping to change at page 88, line 5
the client was requesting. the client was requesting.
3. READ_PLUS(s, 256K, 64K) --> NFS_OK, eof = false, <data[256K, 3. READ_PLUS(s, 256K, 64K) --> NFS_OK, eof = false, <data[256K,
288K], hole[288K, 354K]>. Returns an array of the 32K data and 288K], hole[288K, 354K]>. Returns an array of the 32K data and
the hole which extends to 354K. the hole which extends to 354K.
4. READ_PLUS(s, 354K, 64K) --> NFS_OK, eof = true, <data[354K, 4. READ_PLUS(s, 354K, 64K) --> NFS_OK, eof = true, <data[354K,
418K]>. Returns the final 64K of data and informs the client 418K]>. Returns the final 64K of data and informs the client
there is no more data in the file. there is no more data in the file.
15.12. Operation 69: SEEK - Find the Next Data or Hole 16.11. Operation 69: SEEK - Find the Next Data or Hole
16.11.1. ARGUMENT
enum data_content4 {
NFS4_CONTENT_DATA = 0,
NFS4_CONTENT_HOLE = 1
};
15.12.1. ARGUMENT
struct SEEK4args { struct SEEK4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 sa_stateid; stateid4 sa_stateid;
offset4 sa_offset; offset4 sa_offset;
data_content4 sa_what; data_content4 sa_what;
}; };
15.12.2. RESULT 16.11.2. RESULT
struct seek_res4 { struct seek_res4 {
bool sr_eof; bool sr_eof;
offset4 sr_offset; offset4 sr_offset;
}; };
union SEEK4res switch (nfsstat4 sa_status) { union SEEK4res switch (nfsstat4 sa_status) {
case NFS4_OK: case NFS4_OK:
seek_res4 resok4; seek_res4 resok4;
default: default:
void; void;
}; };
15.12.3. DESCRIPTION 16.11.3. DESCRIPTION
SEEK is an operation that allows a client to determine the location SEEK is an operation that allows a client to determine the location
of the next data_content4 in a file. It allows an implementation of of the next data_content4 in a file. It allows an implementation of
the emerging extension to lseek(2) to allow clients to determine the the emerging extension to lseek(2) to allow clients to determine the
next hole whilst in data or the next data whilst in a hole. next hole whilst in data or the next data whilst in a hole.
From the given sa_offset, find the next data_content4 of type sa_what From the given sa_offset, find the next data_content4 of type sa_what
in the file. If the server can not find a corresponding sa_what, in the file. If the server can not find a corresponding sa_what,
then the status will still be NFS4_OK, but sr_eof would be TRUE. If then the status will still be NFS4_OK, but sr_eof would be TRUE. If
the server can find the sa_what, then the sr_offset is the start of the server can find the sa_what, then the sr_offset is the start of
that content. that content. If the sa_offset is beyond the end of the file, then
SEEK MUST return NFS4ERR_NXIO.
All files MUST have a virtual hole at the end of the file. I.e., if
a filesystem does not support sparse files, then a compound with
{SEEK 0 NFS4_CONTENT_HOLE;} would return a result of {SEEK 1 X;}
where 'X' was the size of the file.
SEEK must follow the same rules for stateids as READ_PLUS SEEK must follow the same rules for stateids as READ_PLUS
(Section 15.11.3). (Section 16.10.3).
15.13. Operation 70: WRITE_SAME - WRITE an ADB Multiple Times to a File 16.12. Operation 70: WRITE_SAME - WRITE an ADB Multiple Times to a File
15.13.1. ARGUMENT 16.12.1. ARGUMENT
enum stable_how4 { enum stable_how4 {
UNSTABLE4 = 0, UNSTABLE4 = 0,
DATA_SYNC4 = 1, DATA_SYNC4 = 1,
FILE_SYNC4 = 2 FILE_SYNC4 = 2
}; };
struct app_data_block4 { struct app_data_block4 {
offset4 adb_offset; offset4 adb_offset;
length4 adb_block_size; length4 adb_block_size;
length4 adb_block_count; length4 adb_block_count;
length4 adb_reloff_blocknum; length4 adb_reloff_blocknum;
count4 adb_block_num; count4 adb_block_num;
length4 adb_reloff_pattern; length4 adb_reloff_pattern;
opaque adb_pattern<>; opaque adb_pattern<>;
}; };
skipping to change at page 86, line 21 skipping to change at page 89, line 35
opaque adb_pattern<>; opaque adb_pattern<>;
}; };
struct WRITE_SAME4args { struct WRITE_SAME4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 wsa_stateid; stateid4 wsa_stateid;
stable_how4 wsa_stable; stable_how4 wsa_stable;
app_data_block4 wsa_adb; app_data_block4 wsa_adb;
}; };
15.13.2. RESULT 16.12.2. RESULT
struct write_response4 { struct write_response4 {
stateid4 wr_callback_id<1>; stateid4 wr_callback_id<1>;
length4 wr_count; length4 wr_count;
stable_how4 wr_committed; stable_how4 wr_committed;
verifier4 wr_writeverf; verifier4 wr_writeverf;
}; };
union WRITE_SAME4res switch (nfsstat4 wsr_status) { union WRITE_SAME4res switch (nfsstat4 wsr_status) {
case NFS4_OK: case NFS4_OK:
write_response4 resok4; write_response4 resok4;
default: default:
void; void;
}; };
15.13.3. DESCRIPTION 16.12.3. DESCRIPTION
The WRITE_SAME operation writes an application data block to the The WRITE_SAME operation writes an application data block to the
regular file identified by the current filehandle (see WRITE SAME regular file identified by the current filehandle (see WRITE SAME
(10) in [T10-SBC2]). The target file is specified by the current (10) in [T10-SBC2]). The target file is specified by the current
filehandle. The data to be written is specified by an filehandle. The data to be written is specified by an
app_data_block4 structure (Section 8.1.1). The client specifies with app_data_block4 structure (Section 8.1.1). The client specifies with
the wsa_stable parameter the method of how the data is to be the wsa_stable parameter the method of how the data is to be
processed by the server. It is treated like the stable parameter in processed by the server. It is treated like the stable parameter in
the NFSv4.1 WRITE operation (see Section 18.2 of [RFC5661]). the NFSv4.1 WRITE operation (see Section 18.2 of [RFC5661]).
A successful WRITE_SAME will construct a reply for wr_count, A successful WRITE_SAME will construct a reply for wr_count,
wr_committed, and wr_writeverf as per the NFSv4.1 WRITE operation wr_committed, and wr_writeverf as per the NFSv4.1 WRITE operation
results. If wr_callback_id is set, it indicates an asynchronous results. If wr_callback_id is set, it indicates an asynchronous
reply (see Section 15.13.3.1). reply (see Section 16.12.3.1).
WRITE_SAME has to support all of the errors which are returned by WRITE_SAME has to support all of the errors which are returned by
WRITE plus NFS4ERR_NOTSUPP, i.e., it is an OPTIONAL operation. If WRITE plus NFS4ERR_NOTSUPP, i.e., it is an OPTIONAL operation. If
the client supports WRITE_SAME, it MUST support CB_OFFLOAD. the client supports WRITE_SAME, it MUST support CB_OFFLOAD.
If the server supports ADBs, then it MUST support the WRITE_SAME If the server supports ADBs, then it MUST support the WRITE_SAME
operation. The server has no concept of the structure imposed by the operation. The server has no concept of the structure imposed by the
application. It is only when the application writes to a section of application. It is only when the application writes to a section of
the file does order get imposed. In order to detect corruption even the file does order get imposed. In order to detect corruption even
before the application utilizes the file, the application will want before the application utilizes the file, the application will want
skipping to change at page 87, line 33 skipping to change at page 90, line 50
When the server receives the WRITE_SAME operation, it MUST populate When the server receives the WRITE_SAME operation, it MUST populate
adb_block_count ADBs in the file starting at adb_offset. The block adb_block_count ADBs in the file starting at adb_offset. The block
size will be given by adb_block_size. The ADBN (if provided) will size will be given by adb_block_size. The ADBN (if provided) will
start at adb_reloff_blocknum and each block will be monotonically start at adb_reloff_blocknum and each block will be monotonically
numbered starting from adb_block_num in the first block. The pattern numbered starting from adb_block_num in the first block. The pattern
(if provided) will be at adb_reloff_pattern of each block and will be (if provided) will be at adb_reloff_pattern of each block and will be
provided in adb_pattern. provided in adb_pattern.
The server SHOULD return an asynchronous result if it can determine The server SHOULD return an asynchronous result if it can determine
the operation will be long running (see Section 15.13.3.1). Once the operation will be long running (see Section 16.12.3.1). Once
either the WRITE_SAME finishes synchronously or the server uses either the WRITE_SAME finishes synchronously or the server uses
CB_OFFLOAD to inform the client of the asynchronous completion of the CB_OFFLOAD to inform the client of the asynchronous completion of the
WRITE_SAME, the server MUST return the ADBs to clients as data. WRITE_SAME, the server MUST return the ADBs to clients as data.
15.13.3.1. Asynchronous Transactions 16.12.3.1. Asynchronous Transactions
ADB initialization may lead to server determining to service the ADB initialization may lead to server determining to service the
operation asynchronously. If it decides to do so, it sets the operation asynchronously. If it decides to do so, it sets the
stateid in wr_callback_id to be that of the wsa_stateid. If it does stateid in wr_callback_id to be that of the wsa_stateid. If it does
not set the wr_callback_id, then the result is synchronous. not set the wr_callback_id, then the result is synchronous.
When the client determines that the reply will be given When the client determines that the reply will be given
asynchronously, it should not assume anything about the contents of asynchronously, it should not assume anything about the contents of
what it wrote until it is informed by the server that the operation what it wrote until it is informed by the server that the operation
is complete. It can use OFFLOAD_STATUS (Section 15.10) to monitor is complete. It can use OFFLOAD_STATUS (Section 16.9) to monitor the
the operation and OFFLOAD_CANCEL (Section 15.9) to cancel the operation and OFFLOAD_CANCEL (Section 16.8) to cancel the operation.
operation. An example of a asynchronous WRITE_SAME is shown in An example of a asynchronous WRITE_SAME is shown in Figure 6. Note
Figure 6. Note that as with the COPY operation, WRITE_SAME must that as with the COPY operation, WRITE_SAME must provide a stateid
provide a stateid for tracking the asynchronous operation. for tracking the asynchronous operation.
Client Server Client Server
+ + + +
| | | |
|--- OPEN ---------------------------->| Client opens |--- OPEN ---------------------------->| Client opens
|<------------------------------------/| the file |<------------------------------------/| the file
| | | |
|--- WRITE_SAME ----------------------->| Client initializes |--- WRITE_SAME ----------------------->| Client initializes
|<------------------------------------/| an ADB |<------------------------------------/| an ADB
| | | |
skipping to change at page 88, line 44 skipping to change at page 92, line 17
information that a synchronous WRITE_SAME would have provided. information that a synchronous WRITE_SAME would have provided.
Regardless of whether the operation is asynchronous or synchronous, Regardless of whether the operation is asynchronous or synchronous,
it MUST still support the COMMIT operation semantics as outlined in it MUST still support the COMMIT operation semantics as outlined in
Section 18.3 of [RFC5661]. I.e., COMMIT works on one or more WRITE Section 18.3 of [RFC5661]. I.e., COMMIT works on one or more WRITE
operations and the WRITE_SAME operation can appear as several WRITE operations and the WRITE_SAME operation can appear as several WRITE
operations to the server. The client can use locking operations to operations to the server. The client can use locking operations to
control the behavior on the server with respect to long running control the behavior on the server with respect to long running
asynchronous write operations. asynchronous write operations.
15.13.3.2. Error Handling of a Partially Complete WRITE_SAME 16.12.3.2. Error Handling of a Partially Complete WRITE_SAME
WRITE_SAME will clone adb_block_count copies of the given ADB in WRITE_SAME will clone adb_block_count copies of the given ADB in
consecutive order in the file starting at adb_offset. An error can consecutive order in the file starting at adb_offset. An error can
occur after writing the Nth ADB to the file. WRITE_SAME MUST appear occur after writing the Nth ADB to the file. WRITE_SAME MUST appear
to populate the range of the file as if the client used WRITE to to populate the range of the file as if the client used WRITE to
transfer the instantiated ADBs. I.e., the contents of the range will transfer the instantiated ADBs. I.e., the contents of the range will
be easy for the client to determine in case of a partially complete be easy for the client to determine in case of a partially complete
WRITE_SAME. WRITE_SAME.
16. NFSv4.2 Callback Operations 17. NFSv4.2 Callback Operations
16.1. Operation 15: CB_OFFLOAD - Report results of an asynchronous 17.1. Operation 15: CB_OFFLOAD - Report results of an asynchronous
operation operation
16.1.1. ARGUMENT 17.1.1. ARGUMENT
struct write_response4 { struct write_response4 {
stateid4 wr_callback_id<1>; stateid4 wr_callback_id<1>;
length4 wr_count; length4 wr_count;
stable_how4 wr_committed; stable_how4 wr_committed;
verifier4 wr_writeverf; verifier4 wr_writeverf;
}; };
union offload_info4 switch (nfsstat4 coa_status) { union offload_info4 switch (nfsstat4 coa_status) {
case NFS4_OK: case NFS4_OK:
write_response4 coa_resok4; write_response4 coa_resok4;
default: default:
length4 coa_bytes_copied; length4 coa_bytes_copied;
}; };
struct CB_OFFLOAD4args { struct CB_OFFLOAD4args {
nfs_fh4 coa_fh; nfs_fh4 coa_fh;
stateid4 coa_stateid; stateid4 coa_stateid;
skipping to change at page 89, line 32 skipping to change at page 93, line 17
default: default:
length4 coa_bytes_copied; length4 coa_bytes_copied;
}; };
struct CB_OFFLOAD4args { struct CB_OFFLOAD4args {
nfs_fh4 coa_fh; nfs_fh4 coa_fh;
stateid4 coa_stateid; stateid4 coa_stateid;
offload_info4 coa_offload_info; offload_info4 coa_offload_info;
}; };
16.1.2. RESULT 17.1.2. RESULT
struct CB_OFFLOAD4res { struct CB_OFFLOAD4res {
nfsstat4 cor_status; nfsstat4 cor_status;
}; };
16.1.3. DESCRIPTION 17.1.3. DESCRIPTION
CB_OFFLOAD is used to report to the client the results of an CB_OFFLOAD is used to report to the client the results of an
asynchronous operation, e.g., Server Side Copy or WRITE_SAME. The asynchronous operation, e.g., Server Side Copy or WRITE_SAME. The
coa_fh and coa_stateid identify the transaction and the coa_status coa_fh and coa_stateid identify the transaction and the coa_status
indicates success or failure. The coa_resok4.wr_callback_id MUST NOT indicates success or failure. The coa_resok4.wr_callback_id MUST NOT
be set. If the transaction failed, then the coa_bytes_copied be set. If the transaction failed, then the coa_bytes_copied
contains the number of bytes copied before the failure occurred. The contains the number of bytes copied before the failure occurred. The
coa_bytes_copied value indicates the number of bytes copied but not coa_bytes_copied value indicates the number of bytes copied but not
which specific bytes have been copied. which specific bytes have been copied.
skipping to change at page 90, line 16 skipping to change at page 94, line 4
then the client is REQUIRED to support the CB_OFFLOAD operation. then the client is REQUIRED to support the CB_OFFLOAD operation.
There is a potential race between the reply to the original There is a potential race between the reply to the original
transaction on the forechannel and the CB_OFFLOAD callback on the transaction on the forechannel and the CB_OFFLOAD callback on the
backchannel. Sections 2.10.6.3 and 20.9.3 of [RFC5661] describe how backchannel. Sections 2.10.6.3 and 20.9.3 of [RFC5661] describe how
to handle this type of issue. to handle this type of issue.
Upon success, the coa_resok4.wr_count presents for each operation: Upon success, the coa_resok4.wr_count presents for each operation:
COPY: the total number of bytes copied COPY: the total number of bytes copied
WRITE_SAME: the same information that a synchronous WRITE_SAME would WRITE_SAME: the same information that a synchronous WRITE_SAME would
provide provide
17. IANA Considerations 18. IANA Considerations
The IANA Considerations for Labeled NFS are addressed in [Quigley14]. The IANA Considerations for Labeled NFS are addressed in [Quigley14].
18. References 19. References
18.1. Normative References 19.1. Normative References
[NFSv42xdr] [NFSv42xdr]
Haynes, T., "Network File System (NFS) Version 4 Minor Haynes, T., "Network File System (NFS) Version 4 Minor
Version 2 External Data Representation Standard (XDR) Version 2 External Data Representation Standard (XDR)
Description", April 2014. Description", September 2014.
[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax", STD 66, RFC Resource Identifier (URI): Generic Syntax", STD 66, RFC
3986, January 2005. 3986, January 2005.
[RFC5661] Shepler, S., Eisler, M., and D. Noveck, "Network File [RFC5661] Shepler, S., Eisler, M., and D. Noveck, "Network File
System (NFS) Version 4 Minor Version 1 Protocol", RFC System (NFS) Version 4 Minor Version 1 Protocol", RFC
5661, January 2010. 5661, January 2010.
[RFC5664] Halevy, B., Welch, B., and J. Zelenka, "Object-Based [RFC5664] Halevy, B., Welch, B., and J. Zelenka, "Object-Based
skipping to change at page 91, line 12 skipping to change at page 94, line 43
Interfaces of The Open Group Base Specifications Issue 6, Interfaces of The Open Group Base Specifications Issue 6,
IEEE Std 1003.1, 2004 Edition", 2004. IEEE Std 1003.1, 2004 Edition", 2004.
[posix_fallocate] [posix_fallocate]
The Open Group, "Section 'posix_fallocate()' of System The Open Group, "Section 'posix_fallocate()' of System
Interfaces of The Open Group Base Specifications Issue 6, Interfaces of The Open Group Base Specifications Issue 6,
IEEE Std 1003.1, 2004 Edition", 2004. IEEE Std 1003.1, 2004 Edition", 2004.
[rpcsec_gssv3] [rpcsec_gssv3]
Adamson, W. and N. Williams, "Remote Procedure Call (RPC) Adamson, W. and N. Williams, "Remote Procedure Call (RPC)
Security Version 3", October 2013. Security Version 3", July 2014.
18.2. Informative References 19.2. Informative References
[Ashdown08] [Ashdown08]
Ashdown, L., "Chapter 15, Validating Database Files and Ashdown, L., "Chapter 15, Validating Database Files and
Backups, of Oracle Database Backup and Recovery User's Backups, of Oracle Database Backup and Recovery User's
Guide 11g Release 1 (11.1)", August 2008. Guide 11g Release 1 (11.1)", August 2008.
[BL73] Bell, D. and L. LaPadula, "Secure Computer Systems:
Mathematical Foundations and Model", Technical Report
M74-244, The MITRE Corporation, Bedford, MA, May 1973.
[Baira08] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci- [Baira08] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
Corruption in the Storage Stack", Proceedings of the 6th Corruption in the Storage Stack", Proceedings of the 6th
USENIX Symposium on File and Storage Technologies (FAST USENIX Symposium on File and Storage Technologies (FAST
'08) , 2008. '08) , 2008.
[I-D.ietf-nfsv4-rfc3530bis] [I-D.ietf-nfsv4-rfc3530bis]
Haynes, T. and D. Noveck, "Network File System (NFS) Haynes, T. and D. Noveck, "Network File System (NFS)
version 4 Protocol", draft-ietf-nfsv4-rfc3530bis-25 (Work version 4 Protocol", draft-ietf-nfsv4-rfc3530bis-33 (Work
In Progress), February 2013. In Progress), April 2014.
[IESG08] ISEG, "IESG Processing of RFC Errata for the IETF Stream", [IESG08] ISEG, "IESG Processing of RFC Errata for the IETF Stream",
2008. 2008.
[MLS] "Section 46.6. Multi-Level Security (MLS) of Deployment
Guide: Deployment, configuration and administration of Red
Hat Enterprise Linux 5, Edition 6", 2011.
[McDougall07] [McDougall07]
McDougall, R. and J. Mauro, "Section 11.4.3, Detecting McDougall, R. and J. Mauro, "Section 11.4.3, Detecting
Memory Corruption of Solaris Internals", 2007. Memory Corruption of Solaris Internals", 2007.
[Quigley14] [Quigley14]
Quigley, D., Lu, J., and T. Haynes, "Registry Quigley, D., Lu, J., and T. Haynes, "Registry
Specification for Mandatory Access Control (MAC) Security Specification for Mandatory Access Control (MAC) Security
Label Formats", draft-ietf-nfsv4-lfs-registry-00 (work in Label Formats", draft-ietf-nfsv4-lfs-registry-01 (work in
progress), 2014. progress), September 2014.
[RFC1108] Kent, S., "Security Options for the Internet Protocol",
RFC 1108, November 1991.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", March 1997. Requirement Levels", March 1997.
[RFC2401] Kent, S. and R. Atkinson, "Security Architecture for the
Internet Protocol", RFC 2401, November 1998.
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
[RFC4506] Eisler, M., "XDR: External Data Representation Standard", [RFC4506] Eisler, M., "XDR: External Data Representation Standard",
RFC 4506, May 2006. RFC 4506, May 2006.
[RFC5663] Black, D., Fridella, S., and J. Glasgow, "Parallel NFS [RFC5663] Black, D., Fridella, S., and J. Glasgow, "Parallel NFS
(pNFS) Block/Volume Layout", RFC 5663, January 2010. (pNFS) Block/Volume Layout", RFC 5663, January 2010.
skipping to change at page 93, line 27 skipping to change at page 97, line 19
For Labeled NFS, the original draft was by David Quigley, James For Labeled NFS, the original draft was by David Quigley, James
Morris, Jarret Lu, and Tom Haynes. Peter Staubach, Trond Myklebust, Morris, Jarret Lu, and Tom Haynes. Peter Staubach, Trond Myklebust,
Stephen Smalley, Sorin Faibish, Nico Williams, and David Black also Stephen Smalley, Sorin Faibish, Nico Williams, and David Black also
contributed in the final push to get this accepted. contributed in the final push to get this accepted.
Christoph Hellwig was very helpful in getting the WRITE_SAME Christoph Hellwig was very helpful in getting the WRITE_SAME
semantics to model more of what T10 was doing for WRITE SAME (10) semantics to model more of what T10 was doing for WRITE SAME (10)
[T10-SBC2]. And he led the push to get space reservations to more [T10-SBC2]. And he led the push to get space reservations to more
closely model the posix_fallocate. closely model the posix_fallocate.
Andy Adamson picked up the RPCSEC_GSSv3 work, which enabled both
Labeled NFS and Server Side Copy to be present more secure options.
Christoph Hellwig provided the update to GETDEVICELIST.
During the review process, Talia Reyes-Ortiz helped the sessions run During the review process, Talia Reyes-Ortiz helped the sessions run
smoothly. While many people contributed here and there, the core smoothly. While many people contributed here and there, the core
reviewers were Andy Adamson, Pranoop Erasani, Bruce Fields, Chuck reviewers were Andy Adamson, Pranoop Erasani, Bruce Fields, Chuck
Lever, Trond Myklebust, David Noveck, Peter Staubach, and Mike Lever, Trond Myklebust, David Noveck, Peter Staubach, and Mike
Kupfer. Kupfer.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
 End of changes. 147 change blocks. 
352 lines changed or deleted 464 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/