draft-ietf-nfsv4-minorversion2-05.txt   draft-ietf-nfsv4-minorversion2-06.txt 
NFSv4 T. Haynes NFSv4 T. Haynes
Internet-Draft Editor Internet-Draft Editor
Intended status: Standards Track September 06, 2011 Intended status: Standards Track November 14, 2011
Expires: March 9, 2012 Expires: May 17, 2012
NFS Version 4 Minor Version 2 NFS Version 4 Minor Version 2
draft-ietf-nfsv4-minorversion2-05.txt draft-ietf-nfsv4-minorversion2-06.txt
Abstract Abstract
This Internet-Draft describes NFS version 4 minor version two, This Internet-Draft describes NFS version 4 minor version two,
focusing mainly on the protocol extensions made from NFS version 4 focusing mainly on the protocol extensions made from NFS version 4
minor version 0 and NFS version 4 minor version 1. Major extensions minor version 0 and NFS version 4 minor version 1. Major extensions
introduced in NFS version 4 minor version two include: Server-side introduced in NFS version 4 minor version two include: Server-side
Copy, Space Reservations, and Support for Sparse Files. Copy, Space Reservations, and Support for Sparse Files.
Requirements Language Requirements Language
skipping to change at page 1, line 40 skipping to change at page 1, line 40
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on March 9, 2012. This Internet-Draft will expire on May 17, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 8 skipping to change at page 3, line 8
Without obtaining an adequate license from the person(s) controlling Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other it for publication as an RFC or to translate it into languages other
than English. than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . . 6 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 6
1.2. Scope of This Document . . . . . . . . . . . . . . . . . . 6 1.2. Scope of This Document . . . . . . . . . . . . . . . . . 6
1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 6 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 6
1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . . 6 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 6
1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . . 6 1.4.1. Application I/O Advise . . . . . . . . . . . . . . . . 6
2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 6 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 7
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 7 2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 7
2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 7 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 7
2.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 9 2.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 9
2.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 10 2.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 10
2.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 13 2.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 13
2.3. Operations . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3. Operations . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 15 2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 15
2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 16 2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 16
2.4. Security Considerations . . . . . . . . . . . . . . . . . 16 2.4. Security Considerations . . . . . . . . . . . . . . . . . 16
2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 16 2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 16
3. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 24 3. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 24 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 24
3.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 25 3.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 25
3.3. Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 25 3.3. Overview of Sparse Files and NFSv4 . . . . . . . . . . . 25
3.4. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 26 3.4. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 26
3.4.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 27 3.4.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 27
3.4.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 29 3.4.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 29
3.4.5. READ_PLUS with Sparse Files Example . . . . . . . . . 30 3.4.5. READ_PLUS with Sparse Files Example . . . . . . . . . 30
3.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . 31 3.5. Related Work . . . . . . . . . . . . . . . . . . . . . . 31
3.6. Other Proposed Designs . . . . . . . . . . . . . . . . . . 31 3.6. Other Proposed Designs . . . . . . . . . . . . . . . . . 31
3.6.1. Multi-Data Server Hole Information . . . . . . . . . . 31 3.6.1. Multi-Data Server Hole Information . . . . . . . . . . 31
3.6.2. Data Result Array . . . . . . . . . . . . . . . . . . 32 3.6.2. Data Result Array . . . . . . . . . . . . . . . . . . 32
3.6.3. User-Defined Sparse Mask . . . . . . . . . . . . . . . 32 3.6.3. User-Defined Sparse Mask . . . . . . . . . . . . . . . 32
3.6.4. Allocated flag . . . . . . . . . . . . . . . . . . . . 32 3.6.4. Allocated flag . . . . . . . . . . . . . . . . . . . . 32
3.6.5. Dense and Sparse pNFS File Layouts . . . . . . . . . . 33 3.6.5. Dense and Sparse pNFS File Layouts . . . . . . . . . . 33
4. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 33 4. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 33
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 33 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 33
4.2. Operations and attributes . . . . . . . . . . . . . . . . 35 4.2. Operations and attributes . . . . . . . . . . . . . . . . 35
4.3. Attribute 77: space_reserved . . . . . . . . . . . . . . . 35 4.3. Attribute 77: space_reserved . . . . . . . . . . . . . . 35
4.4. Attribute 78: space_freed . . . . . . . . . . . . . . . . 36 4.4. Attribute 78: space_freed . . . . . . . . . . . . . . . . 36
4.5. Attribute 79: max_hole_punch . . . . . . . . . . . . . . . 36 5. Support for Application IO Hints . . . . . . . . . . . . . . . 36
5. Application Data Block Support . . . . . . . . . . . . . . . . 36 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 36
5.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 37 5.2. POSIX Requirements . . . . . . . . . . . . . . . . . . . 37
5.1.1. Data Block Representation . . . . . . . . . . . . . . 38 5.3. Additional Requirements . . . . . . . . . . . . . . . . . 38
5.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 38 5.4. Security Considerations . . . . . . . . . . . . . . . . . 39
5.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 38 5.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 39
5.3. An Example of Detecting Corruption . . . . . . . . . . . . 39
5.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . . 40
5.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 41
6. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 41
6.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 42
6.3. MAC Security Attribute . . . . . . . . . . . . . . . . . . 43
6.3.1. Interpreting FATTR4_SEC_LABEL . . . . . . . . . . . . 44
6.3.2. Delegations . . . . . . . . . . . . . . . . . . . . . 44
6.3.3. Permission Checking . . . . . . . . . . . . . . . . . 45
6.3.4. Object Creation . . . . . . . . . . . . . . . . . . . 45
6.3.5. Existing Objects . . . . . . . . . . . . . . . . . . . 45
6.3.6. Label Changes . . . . . . . . . . . . . . . . . . . . 45
6.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 46
6.5. Discovery of Server LNFS Support . . . . . . . . . . . . . 47
6.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 47
6.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 47
6.6.2. Smart Client Mode . . . . . . . . . . . . . . . . . . 49
6.6.3. Smart Server Mode . . . . . . . . . . . . . . . . . . 49
6.7. Security Considerations . . . . . . . . . . . . . . . . . 50
7. Sharing change attribute implementation details with NFSv4
clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 51
7.2. Definition of the 'change_attr_type' per-file system
attribute . . . . . . . . . . . . . . . . . . . . . . . . 51
8. Security Considerations . . . . . . . . . . . . . . . . . . . 53
9. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 53
10. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 56
10.1. Operation 59: COPY - Initiate a server-side copy . . . . . 56
10.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . . 64
10.3. Operation 61: COPY_NOTIFY - Notify a source server of
a future copy . . . . . . . . . . . . . . . . . . . . . . 65
10.4. Operation 62: COPY_REVOKE - Revoke a destination
server's copy privileges . . . . . . . . . . . . . . . . . 68
10.5. Operation 63: COPY_STATUS - Poll for status of a
server-side copy . . . . . . . . . . . . . . . . . . . . . 69
10.6. Modification to Operation 42: EXCHANGE_ID -
Instantiate Client ID . . . . . . . . . . . . . . . . . . 70
10.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . . 71
10.8. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 74
10.8.1. Introduction . . . . . . . . . . . . . . . . . . . . . 75
10.8.2. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 75
10.8.3. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 76
10.8.4. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 76
10.8.5. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 76
10.9. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 78
11. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 80
11.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that the
File's Attributes Changed . . . . . . . . . . . . . . . . 80
11.2. Operation 15: CB_COPY - Report results of a 6. Application Data Block Support . . . . . . . . . . . . . . . . 39
server-side copy . . . . . . . . . . . . . . . . . . . . . 80 6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 40
12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 82 6.1.1. Data Block Representation . . . . . . . . . . . . . . 40
13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 41
13.1. Normative References . . . . . . . . . . . . . . . . . . . 82 6.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 41
13.2. Informative References . . . . . . . . . . . . . . . . . . 83 6.3. An Example of Detecting Corruption . . . . . . . . . . . 42
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 84 6.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 43
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 85 6.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 44
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 85 7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 44
7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 45
7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 46
7.3.1. Interpreting FATTR4_SEC_LABEL . . . . . . . . . . . . 46
7.3.2. Delegations . . . . . . . . . . . . . . . . . . . . . 47
7.3.3. Permission Checking . . . . . . . . . . . . . . . . . 47
7.3.4. Object Creation . . . . . . . . . . . . . . . . . . . 48
7.3.5. Existing Objects . . . . . . . . . . . . . . . . . . . 48
7.3.6. Label Changes . . . . . . . . . . . . . . . . . . . . 48
7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 49
7.5. Discovery of Server LNFS Support . . . . . . . . . . . . 49
7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 50
7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 50
7.6.2. Smart Client Mode . . . . . . . . . . . . . . . . . . 51
7.6.3. Smart Server Mode . . . . . . . . . . . . . . . . . . 52
7.7. Security Considerations . . . . . . . . . . . . . . . . . 53
8. Sharing change attribute implementation details with NFSv4
clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 53
8.2. Definition of the 'change_attr_type' per-file system
attribute . . . . . . . . . . . . . . . . . . . . . . . . 54
9. Security Considerations . . . . . . . . . . . . . . . . . . . 55
10. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 55
11. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 59
11.1. Operation 59: COPY - Initiate a server-side copy . . . . 59
11.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . 66
11.3. Operation 61: COPY_NOTIFY - Notify a source server of
a future copy . . . . . . . . . . . . . . . . . . . . . . 67
11.4. Operation 62: COPY_REVOKE - Revoke a destination
server's copy privileges . . . . . . . . . . . . . . . . 70
11.5. Operation 63: COPY_STATUS - Poll for status of a
server-side copy . . . . . . . . . . . . . . . . . . . . 71
11.6. Modification to Operation 42: EXCHANGE_ID -
Instantiate Client ID . . . . . . . . . . . . . . . . . . 72
11.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . 73
11.8. Operation 67: IO_ADVISE - Application I/O access
pattern hints . . . . . . . . . . . . . . . . . . . . . . 76
11.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 83
11.9.1. Introduction . . . . . . . . . . . . . . . . . . . . . 83
11.9.2. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 84
11.9.3. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 84
11.9.4. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 84
11.9.5. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 85
11.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 86
11.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 88
12. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 89
12.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that
the File's Attributes Changed . . . . . . . . . . . . . . 89
12.2. Operation 15: CB_COPY - Report results of a
server-side copy . . . . . . . . . . . . . . . . . . . . 90
13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 91
14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 91
14.1. Normative References . . . . . . . . . . . . . . . . . . 91
14.2. Informative References . . . . . . . . . . . . . . . . . 92
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 94
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 95
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 95
1. Introduction 1. Introduction
1.1. The NFS Version 4 Minor Version 2 Protocol 1.1. The NFS Version 4 Minor Version 2 Protocol
The NFS version 4 minor version 2 (NFSv4.2) protocol is the third The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
minor version of the NFS version 4 (NFSv4) protocol. The first minor minor version of the NFS version 4 (NFSv4) protocol. The first minor
version, NFSv4.0, is described in [10] and the second minor version, version, NFSv4.0, is described in [11] and the second minor version,
NFSv4.1, is described in [2]. It follows the guidelines for minor NFSv4.1, is described in [2]. It follows the guidelines for minor
versioning that are listed in Section 11 of [10]. versioning that are listed in Section 11 of [11].
As a minor version, NFSv4.2 is consistent with the overall goals for As a minor version, NFSv4.2 is consistent with the overall goals for
NFSv4, but extends the protocol so as to better meet those goals, NFSv4, but extends the protocol so as to better meet those goals,
based on experiences with NFSv4.1. In addition, NFSv4.2 has adopted based on experiences with NFSv4.1. In addition, NFSv4.2 has adopted
some additional goals, which motivate some of the major extensions in some additional goals, which motivate some of the major extensions in
NFSv4.2. NFSv4.2.
1.2. Scope of This Document 1.2. Scope of This Document
This document describes the NFSv4.2 protocol. With respect to This document describes the NFSv4.2 protocol. With respect to
skipping to change at page 6, line 45 skipping to change at page 6, line 45
The full XDR for NFSv4.2 is presented in [3]. The full XDR for NFSv4.2 is presented in [3].
1.3. NFSv4.2 Goals 1.3. NFSv4.2 Goals
[[Comment.1: This needs fleshing out! --TH]] [[Comment.1: This needs fleshing out! --TH]]
1.4. Overview of NFSv4.2 Features 1.4. Overview of NFSv4.2 Features
[[Comment.2: This needs fleshing out! --TH]] [[Comment.2: This needs fleshing out! --TH]]
1.4.1. Application I/O Advise
We propose a new IO_ADVISE operation for NFSv4.2 that clients can use
to communicate expected I/O behavior to the server. By communicating
future I/O behavior such as whether a file will be accessed
sequentially or randomly, and whether a file will or will not be
accessed in the near future, servers can optimize future I/O requests
for a file by, for example, prefetching or evicting data. This
operation can be used to support the posix_fadvise function as well
as other applications such as databases and video editors.
1.5. Differences from NFSv4.1 1.5. Differences from NFSv4.1
[[Comment.3: This needs fleshing out! --TH]] [[Comment.3: This needs fleshing out! --TH]]
2. NFS Server-side Copy 2. NFS Server-side Copy
2.1. Introduction 2.1. Introduction
This section describes a server-side copy feature for the NFS This section describes a server-side copy feature for the NFS
protocol. protocol.
The server-side copy feature provides a mechanism for the NFS client The server-side copy feature provides a mechanism for the NFS client
to perform a file copy on the server without the data being to perform a file copy on the server without the data being
transmitted back and forth over the network. transmitted back and forth over the network.
Without this feature, an NFS client copies data from one location to Without this feature, an NFS client copies data from one location to
skipping to change at page 8, line 6 skipping to change at page 8, line 16
server are the same server. Therefore in the context of an intra- server are the same server. Therefore in the context of an intra-
server copy, the terms source server and destination server refer to server copy, the terms source server and destination server refer to
the single server performing the copy. the single server performing the copy.
The operations described below are designed to copy files. Other The operations described below are designed to copy files. Other
file system objects can be copied by building on these operations or file system objects can be copied by building on these operations or
using other techniques. For example if the user wishes to copy a using other techniques. For example if the user wishes to copy a
directory, the client can synthesize a directory copy by first directory, the client can synthesize a directory copy by first
creating the destination directory and then copying the source creating the destination directory and then copying the source
directory's files to the new destination directory. If the user directory's files to the new destination directory. If the user
wishes to copy a namespace junction [11] [12], the client can use the wishes to copy a namespace junction [12] [13], the client can use the
ONC RPC Federated Filesystem protocol [12] to perform the copy. ONC RPC Federated Filesystem protocol [13] to perform the copy.
Specifically the client can determine the source junction's Specifically the client can determine the source junction's
attributes using the FEDFS_LOOKUP_FSN procedure and create a attributes using the FEDFS_LOOKUP_FSN procedure and create a
duplicate junction using the FEDFS_CREATE_JUNCTION procedure. duplicate junction using the FEDFS_CREATE_JUNCTION procedure.
For the inter-server copy protocol, the operations are defined to be For the inter-server copy protocol, the operations are defined to be
compatible with a server-to-server copy protocol in which the compatible with a server-to-server copy protocol in which the
destination server reads the file data from the source server. This destination server reads the file data from the source server. This
model in which the file data is pulled from the source by the model in which the file data is pulled from the source by the
destination has a number of advantages over a model in which the destination has a number of advantages over a model in which the
source pushes the file data to the destination. The advantages of source pushes the file data to the destination. The advantages of
skipping to change at page 14, line 28 skipping to change at page 14, line 28
of the source file to the destination file by replicating the file of the source file to the destination file by replicating the file
system formats at the block level. Another possibility is that the system formats at the block level. Another possibility is that the
source and destination might be two nodes sharing a common storage source and destination might be two nodes sharing a common storage
area network, and thus there is no need to copy any data at all, and area network, and thus there is no need to copy any data at all, and
instead ownership of the file and its contents might simply be re- instead ownership of the file and its contents might simply be re-
assigned to the destination. To allow for these possibilities, the assigned to the destination. To allow for these possibilities, the
destination server is allowed to use a server-to-server copy protocol destination server is allowed to use a server-to-server copy protocol
of its choice. of its choice.
In a heterogeneous environment, using a protocol other than NFSv4.x In a heterogeneous environment, using a protocol other than NFSv4.x
(e.g,. HTTP [13] or FTP [14]) presents some challenges. In (e.g,. HTTP [14] or FTP [15]) presents some challenges. In
particular, the destination server is presented with the challenge of particular, the destination server is presented with the challenge of
accessing the source file given only an NFSv4.x filehandle. accessing the source file given only an NFSv4.x filehandle.
One option for protocols that identify source files with path names One option for protocols that identify source files with path names
is to use an ASCII hexadecimal representation of the source is to use an ASCII hexadecimal representation of the source
filehandle as the file name. filehandle as the file name.
Another option for the source server is to use URLs to direct the Another option for the source server is to use URLs to direct the
destination server to a specialized service. For example, the destination server to a specialized service. For example, the
response to COPY_NOTIFY could include the URL response to COPY_NOTIFY could include the URL
skipping to change at page 16, line 30 skipping to change at page 16, line 30
COPY_ABORT operation or the client replies to a CB_COPY operation. COPY_ABORT operation or the client replies to a CB_COPY operation.
A copy offload stateid's seqid MUST NOT be 0 (zero). In the context A copy offload stateid's seqid MUST NOT be 0 (zero). In the context
of a copy offload operation, it is ambiguous to indicate the most of a copy offload operation, it is ambiguous to indicate the most
recent copy offload operation using a stateid with seqid of 0 (zero). recent copy offload operation using a stateid with seqid of 0 (zero).
Therefore a copy offload stateid with seqid of 0 (zero) MUST be Therefore a copy offload stateid with seqid of 0 (zero) MUST be
considered invalid. considered invalid.
2.4. Security Considerations 2.4. Security Considerations
The security considerations pertaining to NFSv4 [10] apply to this The security considerations pertaining to NFSv4 [11] apply to this
document. document.
The standard security mechanisms provide by NFSv4 [10] may be used to The standard security mechanisms provide by NFSv4 [11] may be used to
secure the protocol described in this document. secure the protocol described in this document.
NFSv4 clients and servers supporting the the inter-server copy NFSv4 clients and servers supporting the the inter-server copy
operations described in this document are REQUIRED to implement [5], operations described in this document are REQUIRED to implement [5],
including the RPCSEC_GSSv3 privileges copy_from_auth and including the RPCSEC_GSSv3 privileges copy_from_auth and
copy_to_auth. If the server-to-server copy protocol is ONC RPC copy_to_auth. If the server-to-server copy protocol is ONC RPC
based, the servers are also REQUIRED to implement the RPCSEC_GSSv3 based, the servers are also REQUIRED to implement the RPCSEC_GSSv3
privilege copy_confirm_auth. These requirements to implement are not privilege copy_confirm_auth. These requirements to implement are not
requirements to use. NFSv4 clients and servers are RECOMMENDED to requirements to use. NFSv4 clients and servers are RECOMMENDED to
use [5] to secure server-side copy operations. use [5] to secure server-side copy operations.
skipping to change at page 23, line 25 skipping to change at page 23, line 25
2.4.1.2.4. Securing Non ONC RPC Server-to-Server Copy Protocols 2.4.1.2.4. Securing Non ONC RPC Server-to-Server Copy Protocols
If the destination won't be using ONC RPC to copy the data, then the If the destination won't be using ONC RPC to copy the data, then the
source and destination are using an unspecified copy protocol. The source and destination are using an unspecified copy protocol. The
destination could use the shared secret and the NFSv4 user id to destination could use the shared secret and the NFSv4 user id to
prove to the source server that the user principal has authorized the prove to the source server that the user principal has authorized the
copy. copy.
For protocols that authenticate user names with passwords (e.g., HTTP For protocols that authenticate user names with passwords (e.g., HTTP
[13] and FTP [14]), the nfsv4 user id could be used as the user name, [14] and FTP [15]), the nfsv4 user id could be used as the user name,
and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared and an ASCII hexadecimal representation of the RPCSEC_GSSv3 shared
secret could be used as the user password or as input into non- secret could be used as the user password or as input into non-
password authentication methods like CHAP [15]. password authentication methods like CHAP [16].
2.4.1.3. Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3 2.4.1.3. Inter-Server Copy via ONC RPC but without RPCSEC_GSSv3
ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the ONC RPC security flavors other than RPCSEC_GSSv3 MAY be used with the
server-side copy offload operations described in this document. In server-side copy offload operations described in this document. In
particular, host-based ONC RPC security flavors such as AUTH_NONE and particular, host-based ONC RPC security flavors such as AUTH_NONE and
AUTH_SYS MAY be used. If a host-based security flavor is used, a AUTH_SYS MAY be used. If a host-based security flavor is used, a
minimal level of protection for the server-to-server copy protocol is minimal level of protection for the server-to-server copy protocol is
possible. possible.
skipping to change at page 24, line 33 skipping to change at page 24, line 33
COMPOUND { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP "10.11.78.56"; LOOKUP COMPOUND { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP "10.11.78.56"; LOOKUP
"_FH" ; OPEN "0x12345" ; GETFH } "_FH" ; OPEN "0x12345" ; GETFH }
The source server will therefore know that these NFSv4.1 operations The source server will therefore know that these NFSv4.1 operations
are being issued by the destination server identified in the are being issued by the destination server identified in the
COPY_NOTIFY. COPY_NOTIFY.
2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3
The same techniques as Section 2.4.1.3, using unique URLs for each The same techniques as Section 2.4.1.3, using unique URLs for each
destination server, can be used for other protocols (e.g., HTTP [13] destination server, can be used for other protocols (e.g., HTTP [14]
and FTP [14]) as well. and FTP [15]) as well.
3. Sparse Files 3. Sparse Files
3.1. Introduction 3.1. Introduction
A sparse file is a common way of representing a large file without A sparse file is a common way of representing a large file without
having to utilize all of the disk space for it. Consequently, a having to utilize all of the disk space for it. Consequently, a
sparse file uses less physical space than its size indicates. This sparse file uses less physical space than its size indicates. This
means the file contains 'holes', byte ranges within the file that means the file contains 'holes', byte ranges within the file that
contain no data. Most modern file systems support sparse files, contain no data. Most modern file systems support sparse files,
skipping to change at page 25, line 35 skipping to change at page 25, line 35
3.2. Terminology 3.2. Terminology
Regular file: An object of file type NF4REG or NF4NAMEDATTR. Regular file: An object of file type NF4REG or NF4NAMEDATTR.
Sparse file: A Regular file that contains one or more Holes. Sparse file: A Regular file that contains one or more Holes.
Hole: A byte range within a Sparse file that contains regions of all Hole: A byte range within a Sparse file that contains regions of all
zeroes. For block-based file systems, this could also be an zeroes. For block-based file systems, this could also be an
unallocated region of the file. unallocated region of the file.
Hole Threshold The minimum length of a Hole as determined by the Hole Threshold: The minimum length of a Hole as determined by the
server. If a server chooses to define a Hole Threshold, then it server. If a server chooses to define a Hole Threshold, then it
would not return hole information (nfs_readplusreshole) with a would not return hole information (nfs_readplusreshole) with a
hole_offset and hole_length that specify a range shorter than the hole_offset and hole_length that specify a range shorter than the
Hole Threshold. Hole Threshold.
3.3. Overview of Sparse Files and NFSv4 3.3. Overview of Sparse Files and NFSv4
This section provides sparse file support to the largest number of This section provides sparse file support to the largest number of
NFS client and server implementations, and as such proposes to add a NFS client and server implementations, and as such proposes to add a
new return code to the READ_PLUS operation instead of proposing new return code to the READ_PLUS operation instead of proposing
skipping to change at page 27, line 13 skipping to change at page 27, line 13
}; };
3.4.2. RESULT 3.4.2. RESULT
union read_plus_content switch (data_content4 content) { union read_plus_content switch (data_content4 content) {
case NFS4_CONTENT_DATA: case NFS4_CONTENT_DATA:
opaque rpc_data<>; opaque rpc_data<>;
case NFS4_CONTENT_APP_BLOCK: case NFS4_CONTENT_APP_BLOCK:
app_data_block4 rpc_block; app_data_block4 rpc_block;
case NFS4_CONTENT_HOLE: case NFS4_CONTENT_HOLE:
hole_info4 rpc_hole; data_info4 rpc_hole;
default: default:
void; void;
}; };
/* /*
* Allow a return of an array of contents. * Allow a return of an array of contents.
*/ */
struct read_plus_res4 { struct read_plus_res4 {
bool rpr_eof; bool rpr_eof;
read_plus_content rpr_contents<>; read_plus_content rpr_contents<>;
skipping to change at page 34, line 32 skipping to change at page 34, line 32
The following operations and attributes can be used to resolve this The following operations and attributes can be used to resolve this
issues: issues:
space_reserved This attribute specifies whether the blocks backing space_reserved This attribute specifies whether the blocks backing
the file have been preallocated. the file have been preallocated.
space_freed This attribute specifies the space freed when a file is space_freed This attribute specifies the space freed when a file is
deleted, taking block sharing into consideration. deleted, taking block sharing into consideration.
max_hole_punch This attribute specifies the maximum sized hole that
can be punched on the filesystem.
INITIALIZED This operation zeroes and/or deallocates the blocks INITIALIZED This operation zeroes and/or deallocates the blocks
backing a region of the file. backing a region of the file.
If space_used of a file is interpreted to mean the size in bytes of If space_used of a file is interpreted to mean the size in bytes of
all disk blocks pointed to by the inode of the file, then shared all disk blocks pointed to by the inode of the file, then shared
blocks get double counted, over-reporting the space utilization. blocks get double counted, over-reporting the space utilization.
This also has the adverse effect that the deletion of a file with This also has the adverse effect that the deletion of a file with
shared blocks frees up less than space_used bytes. shared blocks frees up less than space_used bytes.
On the other hand, if space_used is interpreted to mean the size in On the other hand, if space_used is interpreted to mean the size in
skipping to change at page 36, line 15 skipping to change at page 36, line 11
guaranteed for the new size. If the file size is decreased, space guaranteed for the new size. If the file size is decreased, space
reservation is only guaranteed for the new size and the extra blocks reservation is only guaranteed for the new size and the extra blocks
backing the file can be released. backing the file can be released.
4.4. Attribute 78: space_freed 4.4. Attribute 78: space_freed
space_freed gives the number of bytes freed if the file is deleted. space_freed gives the number of bytes freed if the file is deleted.
This attribute is read only and is of type length4. It is a per file This attribute is read only and is of type length4. It is a per file
attribute. attribute.
4.5. Attribute 79: max_hole_punch 5. Support for Application IO Hints
max_hole_punch specifies the maximum size of a hole that the 5.1. Introduction
INITIALIZE operation can handle. This attribute is read only and of
type length4. It is a per filesystem attribute. This attribute MUST
be implemented if INITIALIZE is implemented. [[Comment.4:
max_hole_punch when doing ADB initialization? --TH]]
5. Application Data Block Support Applications currently have several options for communicating I/O
access patterns to the NFS client. While this can help the NFS
client optimize I/O and caching for a file, it does not allow the NFS
server and its exported file system to do likewise. Therefore, here
we put forth a proposal for the NFSv4.2 protocol to allow
applications to communicate their expected behavior to the server.
By communicating expected access pattern, e.g., sequential or random,
and data re-use behavior, e.g., data range will be read multiple
times and should be cached, the server will be able to better
understand what optimizations it should implement for access to a
file. For example, if a application indicates it will never read the
data more than once, then the file system can avoid polluting the
data cache and not cache the data.
The first application that can issue client I/O hints is the
posix_fadvise operation. For example, on Linux, when an application
uses posix_fadvise to specify a file will be read sequentially, Linux
doubles the readahead buffer size.
Another instance where applications provide an indication of their
desired I/O behavior is the use of direct I/O. By specifying direct
I/O, clients will no longer cache data, but this information is not
passed to the server, which will continue caching data.
Application specific NFS clients such as those used by hypervisors
and databases can also leverage application hints to communicate
their specialized requirements.
This section adds a new IO_ADVISE operation to communicate the client
file access patterns to the NFS server. The NFS server upon
receiving a IO_ADVISE operation MAY choose to alter its I/O and
caching behavior, but is under no obligation to do so.
5.2. POSIX Requirements
The first key requirement of the IO_ADVISE operation is to support
the posix_fadvise function [6], which is supported in Linux and many
other operating systems. Examples and guidance on how to use
posix_fadvise to improve performance can be found here [17].
posix_fadvise is defined as follows,
int posix_fadvise(int fd, off_t offset, off_t len, int advice);
The posix_fadvise() function shall advise the implementation on the
expected behavior of the application with respect to the data in the
file associated with the open file descriptor, fd, starting at offset
and continuing for len bytes. The specified range need not currently
exist in the file. If len is zero, all data following offset is
specified. The implementation may use this information to optimize
handling of the specified data. The posix_fadvise() function shall
have no effect on the semantics of other operations on the specified
data, although it may affect the performance of other operations.
The advice to be applied to the data is specified by the advice
parameter and may be one of the following values:
POSIX_FADV_NORMAL - Specifies that the application has no advice to
give on its behavior with respect to the specified data. It is
the default characteristic if no advice is given for an open file.
POSIX_FADV_SEQUENTIAL - Specifies that the application expects to
access the specified data sequentially from lower offsets to
higher offsets.
POSIX_FADV_RANDOM - Specifies that the application expects to access
the specified data in a random order.
POSIX_FADV_WILLNEED - Specifies that the application expects to
access the specified data in the near future.
POSIX_FADV_DONTNEED - Specifies that the application expects that it
will not access the specified data in the near future.
POSIX_FADV_NOREUSE - Specifies that the application expects to
access the specified data once and then not reuse it thereafter.
Upon successful completion, posix_fadvise() shall return zero;
otherwise, an error number shall be returned to indicate the error.
5.3. Additional Requirements
Many use cases exist for sending application I/O hints to the server
that cannot utilize the POSIX supported interface. This is because
some applications may benefit from additional hints not specified by
posix_fadvise, and some applications may not use POSIX altogether.
One use case is "Opportunistic Prefetch", which allows a stateid
holder to tell the server that it is possible that it will access the
specified data in the near future. This is similar to
POSIX_FADV_WILLNEED, but the client is unsure it will in fact read
the specified data, so the server should only prefetch the data if it
can be done at a marginal cost. For example, when a server receives
this hint, it could prefetch only the indirect blocks for a file
instead of all the data. This would still improve performance if the
client does read the data, but with less pressure on server memory.
An example use case for this hint is a database that reads in a
single record that points to additional records in either other areas
of the same file or different files located on the same or different
server. While it is likely that the application may access the
additional records, it is far from guaranteed. Therefore, the
database may issue an opportunistic prefetch (instead of
POSIX_FADV_WILLNEED) for the data in the other files pointed to by
the record.
Another use case is "Direct I/O", which allows a stated holder to
inform the server that it does not wish to cache data. Today, for
applications that only intend to read data once, the use of direct
I/O disables client caching, but does not affect server caching. By
caching data that will not be re-read, the server is polluting its
cache and possibly causing useful cached data to be evicted. By
informing the server of its expected I/O access, this situation can
be avoid. Direct I/O can be used in Linux and AIX via the open()
O_DIRECT parameter, in Solaris via the directio() function, and in
Windows via the CreateFile() FILE_FLAG_NO_BUFFERING flag.
Another use case is "Backward Sequential Read", which allows a stated
holder to inform the server that it intends to read the specified
data backwards, i.e., back the end to the beginning. This is
different than POSIX_FADV_SEQUENTIAL, whose implied intention was
that data will be read from beginning to end. This hint allows
servers to prefetch data at the end of the range first, and then
prefetch data sequentially in a backwards manner to the start of the
data range. One example of an application that can make use of this
hint is video editing.
5.4. Security Considerations
None.
5.5. IANA Considerations
The IO_ADVISE_type4 will be extended through an IANA registry.
6. Application Data Block Support
At the OS level, files are contained on disk blocks. Applications At the OS level, files are contained on disk blocks. Applications
are also free to impose structure on the data contained in a file and are also free to impose structure on the data contained in a file and
we can define an Application Data Block (ADB) to be such a structure. we can define an Application Data Block (ADB) to be such a structure.
From the application's viewpoint, it only wants to handle ADBs and From the application's viewpoint, it only wants to handle ADBs and
not raw bytes (see [16]). An ADB is typically comprised of two not raw bytes (see [18]). An ADB is typically comprised of two
sections: a header and data. The header describes the sections: a header and data. The header describes the
characteristics of the block and can provide a means to detect characteristics of the block and can provide a means to detect
corruption in the data payload. The data section is typically corruption in the data payload. The data section is typically
initialized to all zeros. initialized to all zeros.
The format of the header is application specific, but there are two The format of the header is application specific, but there are two
main components typically encountered: main components typically encountered:
1. An ADB Number (ADBN), which allows the application to determine 1. An ADB Number (ADBN), which allows the application to determine
which data block is being referenced. The ADBN is a logical which data block is being referenced. The ADBN is a logical
block number and is useful when the client is not storing the block number and is useful when the client is not storing the
blocks in contiguous memory. blocks in contiguous memory.
2. Fields to describe the state of the ADB and a means to detect 2. Fields to describe the state of the ADB and a means to detect
block corruption. For both pieces of data, a useful property is block corruption. For both pieces of data, a useful property is
that allowed values be unique in that if passed across the that allowed values be unique in that if passed across the
network, corruption due to translation between big and little network, corruption due to translation between big and little
endian architectures are detectable. For example, 0xF0DEDEF0 has endian architectures are detectable. For example, 0xF0DEDEF0 has
the same bit pattern in both architectures. the same bit pattern in both architectures.
Applications already impose structures on files [16] and detect Applications already impose structures on files [18] and detect
corruption in data blocks [17]. What they are not able to do is corruption in data blocks [19]. What they are not able to do is
efficiently transfer and store ADBs. To initialize a file with ADBs, efficiently transfer and store ADBs. To initialize a file with ADBs,
the client must send the full ADB to the server and that must be the client must send the full ADB to the server and that must be
stored on the server. When the application is initializing a file to stored on the server. When the application is initializing a file to
have the ADB structure, it could compress the ADBs to just the have the ADB structure, it could compress the ADBs to just the
information to necessary to later reconstruct the header portion of information to necessary to later reconstruct the header portion of
the ADB when the contents are read back. Using sparse file the ADB when the contents are read back. Using sparse file
techniques, the disk blocks described by would not be allocated. techniques, the disk blocks described by would not be allocated.
Unlike sparse file techniques, there would be a small cost to store Unlike sparse file techniques, there would be a small cost to store
the compressed header data. the compressed header data.
In this section, we are going to define a generic framework for an In this section, we are going to define a generic framework for an
ADB, present one approach to detecting corruption in a given ADB ADB, present one approach to detecting corruption in a given ADB
implementation, and describe the model for how the client and server implementation, and describe the model for how the client and server
can support efficient initialization of ADBs, reading of ADB holes, can support efficient initialization of ADBs, reading of ADB holes,
punching holes in ADBs, and space reservation. Further, we need to punching holes in ADBs, and space reservation. Further, we need to
be able to extend this model to applications which do not support be able to extend this model to applications which do not support
ADBs, but wish to be able to handle sparse files, hole punching, and ADBs, but wish to be able to handle sparse files, hole punching, and
space reservation. space reservation.
5.1. Generic Framework 6.1. Generic Framework
We want the representation of the ADB to be flexible enough to We want the representation of the ADB to be flexible enough to
support many different applications. The most basic approach is no support many different applications. The most basic approach is no
imposition of a block at all, which means we are working with the raw imposition of a block at all, which means we are working with the raw
bytes. Such an approach would be useful for storing holes, punching bytes. Such an approach would be useful for storing holes, punching
holes, etc. In more complex deployments, a server might be holes, etc. In more complex deployments, a server might be
supporting multiple applications, each with their own definition of supporting multiple applications, each with their own definition of
the ADB. One might store the ADBN at the start of the block and then the ADB. One might store the ADBN at the start of the block and then
have a guard pattern to detect corruption [18]. The next might store have a guard pattern to detect corruption [20]. The next might store
the ADBN at an offset of 100 bytes within the block and have no guard the ADBN at an offset of 100 bytes within the block and have no guard
pattern at all. The point is that existing applications might pattern at all. The point is that existing applications might
already have well defined formats for their data blocks. already have well defined formats for their data blocks.
The guard pattern can be used to represent the state of the block, to The guard pattern can be used to represent the state of the block, to
protect against corruption, or both. Again, it needs to be able to protect against corruption, or both. Again, it needs to be able to
be placed anywhere within the ADB. be placed anywhere within the ADB.
We need to be able to represent the starting offset of the block and We need to be able to represent the starting offset of the block and
the size of the block. Note that nothing prevents the application the size of the block. Note that nothing prevents the application
from defining different sized blocks in a file. from defining different sized blocks in a file.
5.1.1. Data Block Representation 6.1.1. Data Block Representation
struct app_data_block4 { struct app_data_block4 {
offset4 adb_offset; offset4 adb_offset;
length4 adb_block_size; length4 adb_block_size;
length4 adb_block_count; length4 adb_block_count;
length4 adb_reloff_blocknum; length4 adb_reloff_blocknum;
count4 adb_block_num; count4 adb_block_num;
length4 adb_reloff_pattern; length4 adb_reloff_pattern;
opaque adb_pattern<>; opaque adb_pattern<>;
}; };
skipping to change at page 38, line 27 skipping to change at page 41, line 9
The app_data_block4 structure captures the abstraction presented for The app_data_block4 structure captures the abstraction presented for
the ADB. The additional fields present are to allow the transmission the ADB. The additional fields present are to allow the transmission
of adb_block_count ADBs at one time. We also use adb_block_num to of adb_block_count ADBs at one time. We also use adb_block_num to
convey the ADBN of the first block in the sequence. Each ADB will convey the ADBN of the first block in the sequence. Each ADB will
contain the same adb_pattern string. contain the same adb_pattern string.
As both adb_block_num and adb_pattern are optional, if either As both adb_block_num and adb_pattern are optional, if either
adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX, adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX,
then the corresponding field is not set in any of the ADB. then the corresponding field is not set in any of the ADB.
5.1.2. Data Content 6.1.2. Data Content
/* /*
* Use an enum such that we can extend new types. * Use an enum such that we can extend new types.
*/ */
enum data_content4 { enum data_content4 {
NFS4_CONTENT_DATA = 0, NFS4_CONTENT_DATA = 0,
NFS4_CONTENT_APP_BLOCK = 1, NFS4_CONTENT_APP_BLOCK = 1,
NFS4_CONTENT_HOLE = 2 NFS4_CONTENT_HOLE = 2
}; };
New operations might need to differentiate between wanting to access New operations might need to differentiate between wanting to access
data versus an ADB. Also, future minor versions might want to data versus an ADB. Also, future minor versions might want to
introduce new data formats. This enumeration allows that to occur. introduce new data formats. This enumeration allows that to occur.
5.2. pNFS Considerations 6.2. pNFS Considerations
While this document does not mandate how sparse ADBs are recorded on While this document does not mandate how sparse ADBs are recorded on
the server, it does make the assumption that such information is not the server, it does make the assumption that such information is not
in the file. I.e., the information is metadata. As such, the in the file. I.e., the information is metadata. As such, the
INITIALIZE operation is defined to be not supported by the DS - it INITIALIZE operation is defined to be not supported by the DS - it
must be issued to the MDS. But since the client must not assume a must be issued to the MDS. But since the client must not assume a
priori whether a read is sparse or not, the READ_PLUS operation MUST priori whether a read is sparse or not, the READ_PLUS operation MUST
be supported by both the DS and the MDS. I.e., the client might be supported by both the DS and the MDS. I.e., the client might
impose on the MDS to asynchronously read the data from the DS. impose on the MDS to asynchronously read the data from the DS.
Furthermore, each DS MUST not report to a client either a sparse ADB Furthermore, each DS MUST not report to a client either a sparse ADB
or data which belongs to another DS. One implication of this or data which belongs to another DS. One implication of this
requirement is that the app_data_block4's adb_block_size MUST be requirement is that the app_data_block4's adb_block_size MUST be
either be the stripe width or the stripe width must be an even either be the stripe width or the stripe width must be an even
multiple of it. multiple of it.
The second implication here is that the DS must be able to use the The second implication here is that the DS must be able to use the
Control Protocol to determine from the MDS where the sparse ADBs Control Protocol to determine from the MDS where the sparse ADBs
occur. [[Comment.5: Need to discuss what happens if after the file occur. [[Comment.4: Need to discuss what happens if after the file
is being written to and an INITIALIZE occurs? --TH]] Perhaps instead is being written to and an INITIALIZE occurs? --TH]] Perhaps instead
of the DS pulling from the MDS, the MDS pushes to the DS? Thus an of the DS pulling from the MDS, the MDS pushes to the DS? Thus an
INITIALIZE causes a new push? [[Comment.6: Still need to consider INITIALIZE causes a new push? [[Comment.5: Still need to consider
race cases of the DS getting a WRITE and the MDS getting an race cases of the DS getting a WRITE and the MDS getting an
INITIALIZE. --TH]] INITIALIZE. --TH]]
5.3. An Example of Detecting Corruption 6.3. An Example of Detecting Corruption
In this section, we define an ADB format in which corruption can be In this section, we define an ADB format in which corruption can be
detected. Note that this is just one possible format and means to detected. Note that this is just one possible format and means to
detect corruption. detect corruption.
Consider a very basic implementation of an operating system's disk Consider a very basic implementation of an operating system's disk
blocks. A block is either data or it is an indirect block which blocks. A block is either data or it is an indirect block which
allows for files to be larger than one block. It is desired to be allows for files to be larger than one block. It is desired to be
able to initialize a block. Lastly, to quickly unlink a file, a able to initialize a block. Lastly, to quickly unlink a file, a
block can be marked invalid. The contents remain intact - which block can be marked invalid. The contents remain intact - which
skipping to change at page 39, line 51 skipping to change at page 42, line 36
0xcafedead - This is the DATA state and indicates that real data 0xcafedead - This is the DATA state and indicates that real data
has been written to this block. has been written to this block.
0xe4e5c001 - This is the INDIRECT state and indicates that the 0xe4e5c001 - This is the INDIRECT state and indicates that the
block contains block counter numbers that are chained off of this block contains block counter numbers that are chained off of this
block. block.
0xba1ed4a3 - This is the INVALID state and indicates that the block 0xba1ed4a3 - This is the INVALID state and indicates that the block
contains data whose contents are garbage. contains data whose contents are garbage.
Finally, it also defines an 8 byte checksum [19] starting at byte 16 Finally, it also defines an 8 byte checksum [21] starting at byte 16
which applies to the remaining contents of the block. If the state which applies to the remaining contents of the block. If the state
is FREE, then that checksum is trivially zero. As such, the is FREE, then that checksum is trivially zero. As such, the
application has no need to transfer the checksum implicitly inside application has no need to transfer the checksum implicitly inside
the ADB - it need not make the transfer layer aware of the fact that the ADB - it need not make the transfer layer aware of the fact that
there is a checksum (see [17] for an example of checksums used to there is a checksum (see [19] for an example of checksums used to
detect corruption in application data blocks). detect corruption in application data blocks).
Corruption in each ADB can be detected thusly: Corruption in each ADB can be detected thusly:
o If the guard pattern is anything other than one of the allowed o If the guard pattern is anything other than one of the allowed
values, including all zeros. values, including all zeros.
o If the guard pattern is FREE and any other byte in the remainder o If the guard pattern is FREE and any other byte in the remainder
of the ADB is anything other than zero. of the ADB is anything other than zero.
skipping to change at page 40, line 43 skipping to change at page 43, line 30
minimum amount of data we incorporated into our generic framework. minimum amount of data we incorporated into our generic framework.
I.e., the guard pattern is sufficient in allowing applications to I.e., the guard pattern is sufficient in allowing applications to
design their own corruption detection. design their own corruption detection.
Finally, it is important to note that none of these corruption checks Finally, it is important to note that none of these corruption checks
occur in the transport layer. The server and client components are occur in the transport layer. The server and client components are
totally unaware of the file format and might report everything as totally unaware of the file format and might report everything as
being transferred correctly even in the case the application detects being transferred correctly even in the case the application detects
corruption. corruption.
5.4. Example of READ_PLUS 6.4. Example of READ_PLUS
The hypothetical application presented in Section 5.3 can be used to The hypothetical application presented in Section 6.3 can be used to
illustrate how READ_PLUS would return an array of results. A file is illustrate how READ_PLUS would return an array of results. A file is
created and initialized with 100 4k ADBs in the FREE state: created and initialized with 100 4k ADBs in the FREE state:
INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface} INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface}
Further, assume the application writes a single ADB at 16k, changing Further, assume the application writes a single ADB at 16k, changing
the guard pattern to 0xcafedead, we would then have in memory: the guard pattern to 0xcafedead, we would then have in memory:
0 -> (16k - 1) : 4k, 4, 0, 0, 8, 0xfeedface 0 -> (16k - 1) : 4k, 4, 0, 0, 8, 0xfeedface
16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX 16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX
20k -> 400k : 4k, 95, 0, 6, 0xfeedface 20k -> 400k : 4k, 95, 0, 6, 0xfeedface
And when the client did a READ_PLUS of 64k at the start of the file, And when the client did a READ_PLUS of 64k at the start of the file,
it would get back a result of an ADB, some data, and a final ADB: it would get back a result of an ADB, some data, and a final ADB:
ADB {0, 4, 0, 0, 8, 0xfeedface} ADB {0, 4, 0, 0, 8, 0xfeedface}
data 4k data 4k
ADB {20k, 4k, 59, 0, 6, 0xfeedface} ADB {20k, 4k, 59, 0, 6, 0xfeedface}
5.5. Zero Filled Holes 6.5. Zero Filled Holes
As applications are free to define the structure of an ADB, it is As applications are free to define the structure of an ADB, it is
trivial to define an ADB which supports zero filled holes. Such a trivial to define an ADB which supports zero filled holes. Such a
case would encompass the traditional definitions of a sparse file and case would encompass the traditional definitions of a sparse file and
hole punching. For example, to punch a 64k hole, starting at 100M, hole punching. For example, to punch a 64k hole, starting at 100M,
into an existing file which has no ADB structure: into an existing file which has no ADB structure:
INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX, INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX,
0, NFS4_UINT64_MAX, 0x0} 0, NFS4_UINT64_MAX, 0x0}
6. Labeled NFS 7. Labeled NFS
6.1. Introduction 7.1. Introduction
Access control models such as Unix permissions or Access Control Access control models such as Unix permissions or Access Control
Lists are commonly referred to as Discretionary Access Control (DAC) Lists are commonly referred to as Discretionary Access Control (DAC)
models. These systems base their access decisions on user identity models. These systems base their access decisions on user identity
and resource ownership. In contrast Mandatory Access Control (MAC) and resource ownership. In contrast Mandatory Access Control (MAC)
models base their access control decisions on the label on the models base their access control decisions on the label on the
subject (usually a process) and the object it wishes to access. subject (usually a process) and the object it wishes to access.
These labels may contain user identity information but usually These labels may contain user identity information but usually
contain additional information. In DAC systems users are free to contain additional information. In DAC systems users are free to
specify the access rules for resources that they own. MAC models specify the access rules for resources that they own. MAC models
skipping to change at page 42, line 19 skipping to change at page 45, line 6
The second change is to provide a method for the server to notify the The second change is to provide a method for the server to notify the
client that the attribute changed on an open file on the server. If client that the attribute changed on an open file on the server. If
the file is closed, then during the open attempt, the client will the file is closed, then during the open attempt, the client will
gather the new attribute value. The server MUST not communicate the gather the new attribute value. The server MUST not communicate the
new value of the attribute, the client MUST query it. This new value of the attribute, the client MUST query it. This
requirement stems from the need for the client to provide sufficient requirement stems from the need for the client to provide sufficient
access rights to the attribute. access rights to the attribute.
The final change necessary is a modification to the RPC layer used in The final change necessary is a modification to the RPC layer used in
NFSv4 in the form of a new version of the RPCSEC_GSS [6] framework. NFSv4 in the form of a new version of the RPCSEC_GSS [7] framework.
In order for an NFSv4 server to apply MAC checks it must obtain In order for an NFSv4 server to apply MAC checks it must obtain
additional information from the client. Several methods were additional information from the client. Several methods were
explored for performing this and it was decided that the best explored for performing this and it was decided that the best
approach was to incorporate the ability to make security attribute approach was to incorporate the ability to make security attribute
assertions through the RPC mechanism. RPCSECGSSv3 [5] outlines a assertions through the RPC mechanism. RPCSECGSSv3 [5] outlines a
method to assert additional security information such as security method to assert additional security information such as security
labels on gss context creation and have that data bound to all RPC labels on gss context creation and have that data bound to all RPC
requests that make use of that context. requests that make use of that context.
6.2. Definitions 7.2. Definitions
Label Format Specifier (LFS): is an identifier used by the client to Label Format Specifier (LFS): is an identifier used by the client to
establish the syntactic format of the security label and the establish the syntactic format of the security label and the
semantic meaning of its components. These specifiers exist in a semantic meaning of its components. These specifiers exist in a
registry associated with documents describing the format and registry associated with documents describing the format and
semantics of the label. semantics of the label.
Label Format Registry: is the IANA registry containing all Label Format Registry: is the IANA registry containing all
registered LFS along with references to the documents that registered LFS along with references to the documents that
describe the syntactic format and semantics of the security label. describe the syntactic format and semantics of the security label.
skipping to change at page 43, line 15 skipping to change at page 45, line 47
Object: is a passive resource within the system that we wish to be Object: is a passive resource within the system that we wish to be
protected. Objects can be entities such as files, directories, protected. Objects can be entities such as files, directories,
pipes, sockets, and many other system resources relevant to the pipes, sockets, and many other system resources relevant to the
protection of the system state. protection of the system state.
Subject: A subject is an active entity usually a process which is Subject: A subject is an active entity usually a process which is
requesting access to an object. requesting access to an object.
Multi-Level Security (MLS): is a traditional model where objects are Multi-Level Security (MLS): is a traditional model where objects are
given a sensitivity level (Unclassified, Secret, Top Secret, etc) given a sensitivity level (Unclassified, Secret, Top Secret, etc)
and a category set [20]. and a category set [22].
6.3. MAC Security Attribute 7.3. MAC Security Attribute
MAC models base access decisions on security attributes bound to MAC models base access decisions on security attributes bound to
subjects and objects. This information can range from a user subjects and objects. This information can range from a user
identity for an identity based MAC model, sensitivity levels for identity for an identity based MAC model, sensitivity levels for
Multi-level security, or a type for Type Enforcement. These models Multi-level security, or a type for Type Enforcement. These models
base their decisions on different criteria but the semantics of the base their decisions on different criteria but the semantics of the
security attribute remain the same. The semantics required by the security attribute remain the same. The semantics required by the
security attributes are listed below: security attributes are listed below:
o Must provide flexibility with respect to MAC model. o Must provide flexibility with respect to MAC model.
skipping to change at page 43, line 42 skipping to change at page 46, line 30
o Must provide the ability to enforce access control decisions both o Must provide the ability to enforce access control decisions both
on the client and the server on the client and the server
o Must not expose an object to either the client or server name o Must not expose an object to either the client or server name
space before its security information has been bound to it. space before its security information has been bound to it.
NFSv4 implements the security attribute as a recommended attribute. NFSv4 implements the security attribute as a recommended attribute.
These attributes have a fixed format and semantics, which conflicts These attributes have a fixed format and semantics, which conflicts
with the flexible nature of the security attribute. To resolve this with the flexible nature of the security attribute. To resolve this
the security attribute consists of two components. The first the security attribute consists of two components. The first
component is a LFS as defined in [21] to allow for interoperability component is a LFS as defined in [23] to allow for interoperability
between MAC mechanisms. The second component is an opaque field between MAC mechanisms. The second component is an opaque field
which is the actual security attribute data. To allow for various which is the actual security attribute data. To allow for various
MAC models NFSv4 should be used solely as a transport mechanism for MAC models NFSv4 should be used solely as a transport mechanism for
the security attribute. It is the responsibility of the endpoints to the security attribute. It is the responsibility of the endpoints to
consume the security attribute and make access decisions based on consume the security attribute and make access decisions based on
their respective models. In addition, creation of objects through their respective models. In addition, creation of objects through
OPEN and CREATE allows for the security attribute to be specified OPEN and CREATE allows for the security attribute to be specified
upon creation. By providing an atomic create and set operation for upon creation. By providing an atomic create and set operation for
the security attribute it is possible to enforce the second and the security attribute it is possible to enforce the second and
fourth requirements. The recommended attribute FATTR4_SEC_LABEL will fourth requirements. The recommended attribute FATTR4_SEC_LABEL will
be used to satisfy this requirement. be used to satisfy this requirement.
6.3.1. Interpreting FATTR4_SEC_LABEL 7.3.1. Interpreting FATTR4_SEC_LABEL
The XDR [22] necessary to implement Labeled NFSv4 is presented below: The XDR [24] necessary to implement Labeled NFSv4 is presented below:
const FATTR4_SEC_LABEL = 81; const FATTR4_SEC_LABEL = 81;
typedef uint32_t policy4; typedef uint32_t policy4;
Figure 6 Figure 6
struct labelformat_spec4 { struct labelformat_spec4 {
policy4 lfs_lfs; policy4 lfs_lfs;
policy4 lfs_pi; policy4 lfs_pi;
skipping to change at page 44, line 32 skipping to change at page 47, line 21
labelformat_spec4 slai_lfs; labelformat_spec4 slai_lfs;
opaque slai_data<>; opaque slai_data<>;
}; };
The FATTR4_SEC_LABEL contains an array of two components with the The FATTR4_SEC_LABEL contains an array of two components with the
first component being an LFS. It serves to provide the receiving end first component being an LFS. It serves to provide the receiving end
with the information necessary to translate the security attribute with the information necessary to translate the security attribute
into a form that is usable by the endpoint. Label Formats assigned into a form that is usable by the endpoint. Label Formats assigned
an LFS may optionally choose to include a Policy Identifier field to an LFS may optionally choose to include a Policy Identifier field to
allow for complex policy deployments. The LFS and Label Format allow for complex policy deployments. The LFS and Label Format
Registry are described in detail in [21]. The translation used to Registry are described in detail in [23]. The translation used to
interpret the security attribute is not specified as part of the interpret the security attribute is not specified as part of the
protocol as it may depend on various factors. The second component protocol as it may depend on various factors. The second component
is an opaque section which contains the data of the attribute. This is an opaque section which contains the data of the attribute. This
component is dependent on the MAC model to interpret and enforce. component is dependent on the MAC model to interpret and enforce.
In particular, it is the responsibility of the LFS specification to In particular, it is the responsibility of the LFS specification to
define a maximum size for the opaque section, slai_data<>. When define a maximum size for the opaque section, slai_data<>. When
creating or modifying a label for an object, the client needs to be creating or modifying a label for an object, the client needs to be
guaranteed that the server will accept a label that is sized guaranteed that the server will accept a label that is sized
correctly. By both client and server being part of a specific MAC correctly. By both client and server being part of a specific MAC
model, the client will be aware of the size. model, the client will be aware of the size.
6.3.2. Delegations 7.3.2. Delegations
In the event that a security attribute is changed on the server while In the event that a security attribute is changed on the server while
a client holds a delegation on the file, the client should follow the a client holds a delegation on the file, the client should follow the
existing protocol with respect to attribute changes. It should flush existing protocol with respect to attribute changes. It should flush
all changes back to the server and relinquish the delegation. all changes back to the server and relinquish the delegation.
6.3.3. Permission Checking 7.3.3. Permission Checking
It is not feasible to enumerate all possible MAC models and even It is not feasible to enumerate all possible MAC models and even
levels of protection within a subset of these models. This means levels of protection within a subset of these models. This means
that the NFSv4 client and servers cannot be expected to directly make that the NFSv4 client and servers cannot be expected to directly make
access control decisions based on the security attribute. Instead access control decisions based on the security attribute. Instead
NFSv4 should defer permission checking on this attribute to the host NFSv4 should defer permission checking on this attribute to the host
system. These checks are performed in addition to existing DAC and system. These checks are performed in addition to existing DAC and
ACL checks outlined in the NFSv4 protocol. Section 6.6 gives a ACL checks outlined in the NFSv4 protocol. Section 7.6 gives a
specific example of how the security attribute is handled under a specific example of how the security attribute is handled under a
particular MAC model. particular MAC model.
6.3.4. Object Creation 7.3.4. Object Creation
When creating files in NFSv4 the OPEN and CREATE operations are used. When creating files in NFSv4 the OPEN and CREATE operations are used.
One of the parameters to these operations is an fattr4 structure One of the parameters to these operations is an fattr4 structure
containing the attributes the file is to be created with. This containing the attributes the file is to be created with. This
allows NFSv4 to atomically set the security attribute of files upon allows NFSv4 to atomically set the security attribute of files upon
creation. When a client is MAC aware it must always provide the creation. When a client is MAC aware it must always provide the
initial security attribute upon file creation. In the event that the initial security attribute upon file creation. In the event that the
server is the only MAC aware entity in the system it should ignore server is the only MAC aware entity in the system it should ignore
the security attribute specified by the client and instead make the the security attribute specified by the client and instead make the
determination itself. A more in depth explanation can be found in determination itself. A more in depth explanation can be found in
Section 6.6. Section 7.6.
6.3.5. Existing Objects 7.3.5. Existing Objects
Note that under the MAC model, all objects must have labels. Note that under the MAC model, all objects must have labels.
Therefore, if an existing server is upgraded to include LNFS support, Therefore, if an existing server is upgraded to include LNFS support,
then it is the responsibility of the security system to define the then it is the responsibility of the security system to define the
behavior for existing objects. For example, if the security system behavior for existing objects. For example, if the security system
is LFS 0, which means the server just stores and returns labels, then is LFS 0, which means the server just stores and returns labels, then
existing files should return labels which are set to an empty value. existing files should return labels which are set to an empty value.
6.3.6. Label Changes 7.3.6. Label Changes
As per the requirements, when a file's security label is modified, As per the requirements, when a file's security label is modified,
the server must notify all clients which have the file opened of the the server must notify all clients which have the file opened of the
change in label. It does so with CB_ATTR_CHANGED. There are change in label. It does so with CB_ATTR_CHANGED. There are
preconditions to making an attribute change imposed by NFSv4 and the preconditions to making an attribute change imposed by NFSv4 and the
security system might want to impose others. In the process of security system might want to impose others. In the process of
meeting these preconditions, the server may chose to either serve the meeting these preconditions, the server may chose to either serve the
request in whole or return NFS4ERR_DELAY to the SETATTR operation. request in whole or return NFS4ERR_DELAY to the SETATTR operation.
If there are open delegations on the file belonging to client other If there are open delegations on the file belonging to client other
than the one making the label change, then the process described in than the one making the label change, then the process described in
Section 6.3.2 must be followed. Section 7.3.2 must be followed.
As the server is always presented with the subject label from the As the server is always presented with the subject label from the
client, it does not necessarily need to communicate the fact that the client, it does not necessarily need to communicate the fact that the
label has changed to the client. In the cases where the change label has changed to the client. In the cases where the change
outright denies the client access, the client will be able to quickly outright denies the client access, the client will be able to quickly
determine that there is a new label in effect. It is in cases where determine that there is a new label in effect. It is in cases where
the client may share the same object between multiple subjects or a the client may share the same object between multiple subjects or a
security system which is not strictly hierarchical that the security system which is not strictly hierarchical that the
CB_ATTR_CHANGED callback is very useful. It allows the server to CB_ATTR_CHANGED callback is very useful. It allows the server to
inform the clients that the cached security attribute is now stale. inform the clients that the cached security attribute is now stale.
skipping to change at page 46, line 26 skipping to change at page 49, line 13
server has a very simple security system which just stores the server has a very simple security system which just stores the
labels. In this system, the MAC label check always allows access, labels. In this system, the MAC label check always allows access,
regardless of the subject label. regardless of the subject label.
The way in which MAC labels are enforced is by the smart client. So The way in which MAC labels are enforced is by the smart client. So
if client A changes a security label on a file, then the server MUST if client A changes a security label on a file, then the server MUST
inform all clients that have the file opened that the label has inform all clients that have the file opened that the label has
changed via CB_ATTR_CHANGED. Then the clients MUST retrieve the new changed via CB_ATTR_CHANGED. Then the clients MUST retrieve the new
label and MUST enforce access via the new attribute values. label and MUST enforce access via the new attribute values.
[[Comment.7: Describe a LFS of 0, which will be the means to indicate [[Comment.6: Describe a LFS of 0, which will be the means to indicate
such a deployment. In the current LFR, 0 is marked as reserved. If such a deployment. In the current LFR, 0 is marked as reserved. If
we use it, then we define the default LFS to be used by a LNFS aware we use it, then we define the default LFS to be used by a LNFS aware
server. I.e., it lets smart clients work together in the face of a server. I.e., it lets smart clients work together in the face of a
dumb server. Note that will supporting this system is optional, it dumb server. Note that will supporting this system is optional, it
will make for a very good debugging mode during development. I.e., will make for a very good debugging mode during development. I.e.,
even if a server does not deploy with another security system, this even if a server does not deploy with another security system, this
mode gets your foot in the door. --TH]] mode gets your foot in the door. --TH]]
6.4. pNFS Considerations 7.4. pNFS Considerations
This section examines the issues in deploying LNFS in a pNFS This section examines the issues in deploying LNFS in a pNFS
community of servers. community of servers.
6.4.1. MAC Label Checks 7.4.1. MAC Label Checks
The new FATTR4_SEC_LABEL attribute is metadata information and as The new FATTR4_SEC_LABEL attribute is metadata information and as
such the DS is not aware of the value contained on the MDS. such the DS is not aware of the value contained on the MDS.
Fortunately, the NFSv4.1 protocol [2] already has provisions for Fortunately, the NFSv4.1 protocol [2] already has provisions for
doing access level checks from the DS to the MDS. In order for the doing access level checks from the DS to the MDS. In order for the
DS to validate the subject label presented by the client, it SHOULD DS to validate the subject label presented by the client, it SHOULD
utilize this mechanism. utilize this mechanism.
If a file's FATTR4_SEC_LABEL is changed, then the MDS should utilize If a file's FATTR4_SEC_LABEL is changed, then the MDS should utilize
CB_ATTR_CHANGED to inform the client of that fact. If the MDS is CB_ATTR_CHANGED to inform the client of that fact. If the MDS is
maintaining maintaining
6.5. Discovery of Server LNFS Support 7.5. Discovery of Server LNFS Support
The server can easily determine that a client supports LNFS when it The server can easily determine that a client supports LNFS when it
queries for the FATTR4_SEC_LABEL label for an object. Note that it queries for the FATTR4_SEC_LABEL label for an object. Note that it
cannot assume that the presence of RPCSEC_GSSv3 indicates LNFS cannot assume that the presence of RPCSEC_GSSv3 indicates LNFS
support. The client might need to discover which LFS the server support. The client might need to discover which LFS the server
supports. supports.
A server which supports LNFS MUST allow a client with any subject A server which supports LNFS MUST allow a client with any subject
label to retrieve the FATTR4_SEC_LABEL attribute for the root label to retrieve the FATTR4_SEC_LABEL attribute for the root
filehandle, ROOTFH. The following compound must always succeed as filehandle, ROOTFH. The following compound must always succeed as
skipping to change at page 47, line 28 skipping to change at page 50, line 15
PUTROOTFH, GETATTR {FATTR4_SEC_LABEL} PUTROOTFH, GETATTR {FATTR4_SEC_LABEL}
Note that the server might have imposed a security flavor on the root Note that the server might have imposed a security flavor on the root
that precludes such access. I.e., if the server requires kerberized that precludes such access. I.e., if the server requires kerberized
access and the client presents a compound with AUTH_SYS, then the access and the client presents a compound with AUTH_SYS, then the
server is allowed to return NFS4ERR_WRONGSEC in this case. But if server is allowed to return NFS4ERR_WRONGSEC in this case. But if
the client presents a correct security flavor, then the server MUST the client presents a correct security flavor, then the server MUST
return the FATTR4_SEC_LABEL attribute with the supported LFS filled return the FATTR4_SEC_LABEL attribute with the supported LFS filled
in. in.
6.6. MAC Security NFS Modes of Operation 7.6. MAC Security NFS Modes of Operation
A system using Labeled NFS may operate in three modes. The first A system using Labeled NFS may operate in three modes. The first
mode provides the most protection and is called "full mode". In this mode provides the most protection and is called "full mode". In this
mode both the client and server implement a MAC model allowing each mode both the client and server implement a MAC model allowing each
end to make an access control decision. The remaining two modes are end to make an access control decision. The remaining two modes are
variations on each other and are called "smart client" and "smart variations on each other and are called "smart client" and "smart
server" modes. In these modes one end of the connection is not server" modes. In these modes one end of the connection is not
implementing a MAC model and because of this these operating modes implementing a MAC model and because of this these operating modes
offer less protection than full mode. offer less protection than full mode.
6.6.1. Full Mode 7.6.1. Full Mode
Full mode environments consist of MAC aware NFSv4 servers and clients Full mode environments consist of MAC aware NFSv4 servers and clients
and may be composed of mixed MAC models and policies. The system and may be composed of mixed MAC models and policies. The system
requires that both the client and server have an opportunity to requires that both the client and server have an opportunity to
perform an access control check based on all relevant information perform an access control check based on all relevant information
within the network. The file object security attribute is provided within the network. The file object security attribute is provided
using the mechanism described in Section 6.3. The security attribute using the mechanism described in Section 7.3. The security attribute
of the subject making the request is transported at the RPC layer of the subject making the request is transported at the RPC layer
using the mechanism described in RPCSECGSSv3 [5]. using the mechanism described in RPCSECGSSv3 [5].
6.6.1.1. Initial Labeling and Translation 7.6.1.1. Initial Labeling and Translation
The ability to create a file is an action that a MAC model may wish The ability to create a file is an action that a MAC model may wish
to mediate. The client is given the responsibility to determine the to mediate. The client is given the responsibility to determine the
initial security attribute to be placed on a file. This allows the initial security attribute to be placed on a file. This allows the
client to make a decision as to the acceptable security attributes to client to make a decision as to the acceptable security attributes to
create a file with before sending the request to the server. Once create a file with before sending the request to the server. Once
the server receives the creation request from the client it may the server receives the creation request from the client it may
choose to evaluate if the security attribute is acceptable. choose to evaluate if the security attribute is acceptable.
Security attributes on the client and server may vary based on MAC Security attributes on the client and server may vary based on MAC
skipping to change at page 48, line 28 skipping to change at page 51, line 11
identify the format and meaning of the opaque portion of the security identify the format and meaning of the opaque portion of the security
attribute. A full mode environment may contain hosts operating in attribute. A full mode environment may contain hosts operating in
several different LFSs and DOIs. In this case a mechanism for several different LFSs and DOIs. In this case a mechanism for
translating the opaque portion of the security attribute is needed. translating the opaque portion of the security attribute is needed.
The actual translation function will vary based on MAC model and The actual translation function will vary based on MAC model and
policy and is out of the scope of this document. If a translation is policy and is out of the scope of this document. If a translation is
unavailable for a given LFS and DOI then the request SHOULD be unavailable for a given LFS and DOI then the request SHOULD be
denied. Another recourse is to allow the host to provide a fallback denied. Another recourse is to allow the host to provide a fallback
mapping for unknown security attributes. mapping for unknown security attributes.
6.6.1.2. Policy Enforcement 7.6.1.2. Policy Enforcement
In full mode access control decisions are made by both the clients In full mode access control decisions are made by both the clients
and servers. When a client makes a request it takes the security and servers. When a client makes a request it takes the security
attribute from the requesting process and makes an access control attribute from the requesting process and makes an access control
decision based on that attribute and the security attribute of the decision based on that attribute and the security attribute of the
object it is trying to access. If the client denies that access an object it is trying to access. If the client denies that access an
RPC call to the server is never made. If however the access is RPC call to the server is never made. If however the access is
allowed the client will make a call to the NFS server. allowed the client will make a call to the NFS server.
When the server receives the request from the client it extracts the When the server receives the request from the client it extracts the
skipping to change at page 49, line 5 skipping to change at page 51, line 34
trying to access to make an access control decision. If the server's trying to access to make an access control decision. If the server's
policy allows this access it will fulfill the client's request, policy allows this access it will fulfill the client's request,
otherwise it will return NFS4ERR_ACCESS. otherwise it will return NFS4ERR_ACCESS.
Implementations MAY validate security attributes supplied over the Implementations MAY validate security attributes supplied over the
network to ensure that they are within a set of attributes permitted network to ensure that they are within a set of attributes permitted
from a specific peer, and if not, reject them. Note that a system from a specific peer, and if not, reject them. Note that a system
may permit a different set of attributes to be accepted from each may permit a different set of attributes to be accepted from each
peer. peer.
6.6.2. Smart Client Mode 7.6.2. Smart Client Mode
Smart client environments consist of NFSv4 servers that are not MAC Smart client environments consist of NFSv4 servers that are not MAC
aware but NFSv4 clients that are. Clients in this environment are aware but NFSv4 clients that are. Clients in this environment are
may consist of groups implementing different MAC models policies. may consist of groups implementing different MAC models policies.
The system requires that all clients in the environment be The system requires that all clients in the environment be
responsible for access control checks. Due to the amount of trust responsible for access control checks. Due to the amount of trust
placed in the clients this mode is only to be used in a trusted placed in the clients this mode is only to be used in a trusted
environment. environment.
6.6.2.1. Initial Labeling and Translation 7.6.2.1. Initial Labeling and Translation
Just like in full mode the client is responsible for determining the Just like in full mode the client is responsible for determining the
initial label upon object creation. The server in smart client mode initial label upon object creation. The server in smart client mode
does not implement a MAC model, however, it may provide the ability does not implement a MAC model, however, it may provide the ability
to restrict the creation and labeling of object with certain labels to restrict the creation and labeling of object with certain labels
based on different criteria as described in Section 6.6.1.2. based on different criteria as described in Section 7.6.1.2.
In a smart client environment a group of clients operate in a single In a smart client environment a group of clients operate in a single
DOI. This removes the need for the clients to maintain a set of DOI DOI. This removes the need for the clients to maintain a set of DOI
translations. Servers should provide a method to allow different translations. Servers should provide a method to allow different
groups of clients to access the server at the same time. However it groups of clients to access the server at the same time. However it
should not let two groups of clients operating in different DOIs to should not let two groups of clients operating in different DOIs to
access the same files. access the same files.
6.6.2.2. Policy Enforcement 7.6.2.2. Policy Enforcement
In smart client mode access control decisions are made by the In smart client mode access control decisions are made by the
clients. When a client accesses an object it obtains the security clients. When a client accesses an object it obtains the security
attribute of the object from the server and combines it with the attribute of the object from the server and combines it with the
security attribute of the process making the request to make an security attribute of the process making the request to make an
access control decision. This check is in addition to the DAC checks access control decision. This check is in addition to the DAC checks
provided by NFSv4 so this may fail based on the DAC criteria even if provided by NFSv4 so this may fail based on the DAC criteria even if
the MAC policy grants access. As the policy check is located on the the MAC policy grants access. As the policy check is located on the
client an access control denial should take the form that is native client an access control denial should take the form that is native
to the platform. to the platform.
6.6.3. Smart Server Mode 7.6.3. Smart Server Mode
Smart server environments consist of NFSv4 servers that are MAC aware Smart server environments consist of NFSv4 servers that are MAC aware
and one or more MAC unaware clients. The server is the only entity and one or more MAC unaware clients. The server is the only entity
enforcing policy, and may selectively provide standard NFS services enforcing policy, and may selectively provide standard NFS services
to clients based on their authentication credentials and/or to clients based on their authentication credentials and/or
associated network attributes (e.g., IP address, network interface). associated network attributes (e.g., IP address, network interface).
The level of trust and access extended to a client in this mode is The level of trust and access extended to a client in this mode is
configuration-specific. configuration-specific.
6.6.3.1. Initial Labeling and Translation 7.6.3.1. Initial Labeling and Translation
In smart server mode all labeling and access control decisions are In smart server mode all labeling and access control decisions are
performed by the NFSv4 server. In this environment the NFSv4 clients performed by the NFSv4 server. In this environment the NFSv4 clients
are not MAC aware so they cannot provide input into the access are not MAC aware so they cannot provide input into the access
control decision. This requires the server to determine the initial control decision. This requires the server to determine the initial
labeling of objects. Normally the subject to use in this calculation labeling of objects. Normally the subject to use in this calculation
would originate from the client. Instead the NFSv4 server may choose would originate from the client. Instead the NFSv4 server may choose
to assign the subject security attribute based on their to assign the subject security attribute based on their
authentication credentials and/or associated network attributes authentication credentials and/or associated network attributes
(e.g., IP address, network interface). (e.g., IP address, network interface).
In smart server mode security attributes are contained solely within In smart server mode security attributes are contained solely within
the NFSv4 server. This means that all security attributes used in the NFSv4 server. This means that all security attributes used in
the system remain within a single LFS and DOI. Since security the system remain within a single LFS and DOI. Since security
attributes will not cross DOIs or change format there is no need to attributes will not cross DOIs or change format there is no need to
provide any translation functionality above that which is needed provide any translation functionality above that which is needed
internally by the MAC model. internally by the MAC model.
6.6.3.2. Policy Enforcement 7.6.3.2. Policy Enforcement
All access control decisions in smart server mode are made by the All access control decisions in smart server mode are made by the
server. The server will assign the subject a security attribute server. The server will assign the subject a security attribute
based on some criteria (e.g., IP address, network interface). Using based on some criteria (e.g., IP address, network interface). Using
the newly calculated security attribute and the security attribute of the newly calculated security attribute and the security attribute of
the object being requested the MAC model makes the access control the object being requested the MAC model makes the access control
check and returns NFS4ERR_ACCESS on a denial and NFS4_OK on success. check and returns NFS4ERR_ACCESS on a denial and NFS4_OK on success.
This check is done transparently to the client so if the MAC This check is done transparently to the client so if the MAC
permission check fails the client may be unaware of the reason for permission check fails the client may be unaware of the reason for
the permission failure. When operating in this mode administrators the permission failure. When operating in this mode administrators
attempting to debug permission failures should be aware to check the attempting to debug permission failures should be aware to check the
MAC policy running on the server in addition to the DAC settings. MAC policy running on the server in addition to the DAC settings.
6.7. Security Considerations 7.7. Security Considerations
This entire document deals with security issues. This entire document deals with security issues.
Depending on the level of protection the MAC system offers there may Depending on the level of protection the MAC system offers there may
be a requirement to tightly bind the security attribute to the data. be a requirement to tightly bind the security attribute to the data.
When only one of the client or server enforces labels, it is When only one of the client or server enforces labels, it is
important to realize that the other side is not enforcing MAC important to realize that the other side is not enforcing MAC
protections. Alternate methods might be in use to handle the lack of protections. Alternate methods might be in use to handle the lack of
MAC support and care should be taken to identify and mitigate threats MAC support and care should be taken to identify and mitigate threats
from possible tampering outside of these methods. from possible tampering outside of these methods.
An example of this is that a server that modifies READDIR or LOOKUP An example of this is that a server that modifies READDIR or LOOKUP
results based on the client's subject label might want to always results based on the client's subject label might want to always
construct the same subject label for a client which does not present construct the same subject label for a client which does not present
one. This will prevent a non-LNFS client from mixing entries in the one. This will prevent a non-LNFS client from mixing entries in the
directory cache. directory cache.
7. Sharing change attribute implementation details with NFSv4 clients 8. Sharing change attribute implementation details with NFSv4 clients
7.1. Introduction 8.1. Introduction
Although both the NFSv4 [10] and NFSv4.1 protocol [2], define the Although both the NFSv4 [11] and NFSv4.1 protocol [2], define the
change attribute as being mandatory to implement, there is little in change attribute as being mandatory to implement, there is little in
the way of guidance. The only feature that is mandated by them is the way of guidance. The only feature that is mandated by them is
that the value must change whenever the file data or metadata change. that the value must change whenever the file data or metadata change.
While this allows for a wide range of implementations, it also leaves While this allows for a wide range of implementations, it also leaves
the client with a conundrum: how does it determine which is the most the client with a conundrum: how does it determine which is the most
recent value for the change attribute in a case where several RPC recent value for the change attribute in a case where several RPC
calls have been issued in parallel? In other words if two COMPOUNDs, calls have been issued in parallel? In other words if two COMPOUNDs,
both containing WRITE and GETATTR requests for the same file, have both containing WRITE and GETATTR requests for the same file, have
been issued in parallel, how does the client determine which of the been issued in parallel, how does the client determine which of the
skipping to change at page 51, line 34 skipping to change at page 54, line 16
requests corresponds to the most recent state of the file? In some requests corresponds to the most recent state of the file? In some
cases, the only recourse may be to send another COMPOUND containing a cases, the only recourse may be to send another COMPOUND containing a
third GETATTR that is fully serialised with the first two. third GETATTR that is fully serialised with the first two.
NFSv4.2 avoids this kind of inefficiency by allowing the server to NFSv4.2 avoids this kind of inefficiency by allowing the server to
share details about how the change attribute is expected to evolve, share details about how the change attribute is expected to evolve,
so that the client may immediately determine which, out of the so that the client may immediately determine which, out of the
several change attribute values returned by the server, is the most several change attribute values returned by the server, is the most
recent. recent.
7.2. Definition of the 'change_attr_type' per-file system attribute 8.2. Definition of the 'change_attr_type' per-file system attribute
enum change_attr_typeinfo { enum change_attr_typeinfo {
NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0, NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0,
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1, NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1,
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2, NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2,
NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3, NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3,
NFS4_CHANGE_TYPE_IS_UNDEFINED = 4 NFS4_CHANGE_TYPE_IS_UNDEFINED = 4
}; };
+------------------+----+---------------------------+-----+ +------------------+----+---------------------------+-----+
skipping to change at page 52, line 26 skipping to change at page 55, line 7
preserved when writing to pNFS data servers. preserved when writing to pNFS data servers.
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS: The change attribute
value MUST be incremented by one unit for every atomic change to value MUST be incremented by one unit for every atomic change to
the file attributes, data or directory contents. In the case the file attributes, data or directory contents. In the case
where the client is writing to pNFS data servers, the number of where the client is writing to pNFS data servers, the number of
increments is not guaranteed to exactly match the number of increments is not guaranteed to exactly match the number of
writes. writes.
NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is NFS4_CHANGE_TYPE_IS_TIME_METADATA: The change attribute is
implemented as suggested in the NFSv4 spec [10] in terms of the implemented as suggested in the NFSv4 spec [11] in terms of the
time_metadata attribute. time_metadata attribute.
NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take NFS4_CHANGE_TYPE_IS_UNDEFINED: The change attribute does not take
values that fit into any of these categories. values that fit into any of these categories.
If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR, If either NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR,
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, or
NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at NFS4_CHANGE_TYPE_IS_TIME_METADATA are set, then the client knows at
the very least that the change attribute is monotonically increasing, the very least that the change attribute is monotonically increasing,
which is sufficient to resolve the question of which value is the which is sufficient to resolve the question of which value is the
skipping to change at page 53, line 5 skipping to change at page 55, line 32
has the option of detecting rogue server implementations that use has the option of detecting rogue server implementations that use
time_metadata in violation of the spec. time_metadata in violation of the spec.
Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it
has the ability to predict what the resulting change attribute value has the ability to predict what the resulting change attribute value
should be after a COMPOUND containing a SETATTR, WRITE, or CREATE. should be after a COMPOUND containing a SETATTR, WRITE, or CREATE.
This again allows it to detect changes made in parallel by another This again allows it to detect changes made in parallel by another
client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits
the same, but only if the client is not doing pNFS WRITEs. the same, but only if the client is not doing pNFS WRITEs.
8. Security Considerations 9. Security Considerations
9. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 10. Operations: REQUIRED, RECOMMENDED, or OPTIONAL
The following tables summarize the operations of the NFSv4.2 protocol The following tables summarize the operations of the NFSv4.2 protocol
and the corresponding designation of REQUIRED, RECOMMENDED, and and the corresponding designation of REQUIRED, RECOMMENDED, and
OPTIONAL to implement or MUST NOT implement. The designation of MUST OPTIONAL to implement or MUST NOT implement. The designation of MUST
NOT implement is reserved for those operations that were defined in NOT implement is reserved for those operations that were defined in
either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2. either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2.
For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation
for operations sent by the client is for the server implementation. for operations sent by the client is for the server implementation.
The client is generally required to implement the operations needed The client is generally required to implement the operations needed
skipping to change at page 56, line 29 skipping to change at page 59, line 8
| CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS | | CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
| CB_RECALL_SLOT | REQ | | | CB_RECALL_SLOT | REQ | |
| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) | | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) |
| CB_SEQUENCE | OPT | FDELG, DDELG, pNFS | | CB_SEQUENCE | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
| CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS | | CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
+-------------------------+-------------------+---------------------+ +-------------------------+-------------------+---------------------+
10. NFSv4.2 Operations 11. NFSv4.2 Operations
10.1. Operation 59: COPY - Initiate a server-side copy 11.1. Operation 59: COPY - Initiate a server-side copy
10.1.1. ARGUMENT 11.1.1. ARGUMENT
const COPY4_GUARDED = 0x00000001; const COPY4_GUARDED = 0x00000001;
const COPY4_METADATA = 0x00000002; const COPY4_METADATA = 0x00000002;
struct COPY4args { struct COPY4args {
/* SAVED_FH: source file */ /* SAVED_FH: source file */
/* CURRENT_FH: destination file or */ /* CURRENT_FH: destination file or */
/* directory */ /* directory */
offset4 ca_src_offset; offset4 ca_src_offset;
offset4 ca_dst_offset; offset4 ca_dst_offset;
length4 ca_count; length4 ca_count;
uint32_t ca_flags; uint32_t ca_flags;
component4 ca_destination; component4 ca_destination;
netloc4 ca_source_server<>; netloc4 ca_source_server<>;
}; };
10.1.2. RESULT 11.1.2. RESULT
union COPY4res switch (nfsstat4 cr_status) { union COPY4res switch (nfsstat4 cr_status) {
case NFS4_OK: case NFS4_OK:
stateid4 cr_callback_id<1>; stateid4 cr_callback_id<1>;
default: default:
length4 cr_bytes_copied; length4 cr_bytes_copied;
}; };
10.1.3. DESCRIPTION 11.1.3. DESCRIPTION
The COPY operation is used for both intra-server and inter-server The COPY operation is used for both intra-server and inter-server
copies. In both cases, the COPY is always sent from the client to copies. In both cases, the COPY is always sent from the client to
the destination server of the file copy. The COPY operation requests the destination server of the file copy. The COPY operation requests
that a file be copied from the location specified by the SAVED_FH that a file be copied from the location specified by the SAVED_FH
value to the location specified by the combination of CURRENT_FH and value to the location specified by the combination of CURRENT_FH and
ca_destination. ca_destination.
The SAVED_FH must be a regular file. If SAVED_FH is not a regular The SAVED_FH must be a regular file. If SAVED_FH is not a regular
file, the operation MUST fail and return NFS4ERR_WRONG_TYPE. file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.
skipping to change at page 61, line 11 skipping to change at page 63, line 27
| homogeneous | 26 | no | | homogeneous | 26 | no |
| layout_alignment | 66 | no | | layout_alignment | 66 | no |
| layout_blksize | 65 | no | | layout_blksize | 65 | no |
| layout_hint | 63 | no | | layout_hint | 63 | no |
| layout_type | 64 | no | | layout_type | 64 | no |
| maxfilesize | 27 | no | | maxfilesize | 27 | no |
| maxlink | 28 | no | | maxlink | 28 | no |
| maxname | 29 | no | | maxname | 29 | no |
| maxread | 30 | no | | maxread | 30 | no |
| maxwrite | 31 | no | | maxwrite | 31 | no |
| max_hole_punch | 31 | no |
| mdsthreshold | 68 | no | | mdsthreshold | 68 | no |
| mimetype | 32 | MUST | | mimetype | 32 | MUST |
| mode | 33 | MUST | | mode | 33 | MUST |
| mode_set_masked | 74 | no | | mode_set_masked | 74 | no |
| mounted_on_fileid | 55 | no | | mounted_on_fileid | 55 | no |
| no_trunc | 34 | no | | no_trunc | 34 | no |
| numlinks | 35 | no | | numlinks | 35 | no |
| owner | 36 | MUST | | owner | 36 | MUST |
| owner_group | 37 | MUST | | owner_group | 37 | MUST |
| quota_avail_hard | 38 | no | | quota_avail_hard | 38 | no |
skipping to change at page 64, line 23 skipping to change at page 66, line 36
NFS4ERR_DELAY: The server does not have the resources to perform the NFS4ERR_DELAY: The server does not have the resources to perform the
copy operation at the current time. The client should retry the copy operation at the current time. The client should retry the
operation sometime in the future. operation sometime in the future.
NFS4ERR_METADATA_NOTSUPP: The destination file cannot support the NFS4ERR_METADATA_NOTSUPP: The destination file cannot support the
same metadata as the source file. same metadata as the source file.
NFS4ERR_WRONGSEC: The security mechanism being used by the client NFS4ERR_WRONGSEC: The security mechanism being used by the client
does not match the server's security policy. does not match the server's security policy.
10.2. Operation 60: COPY_ABORT - Cancel a server-side copy 11.2. Operation 60: COPY_ABORT - Cancel a server-side copy
10.2.1. ARGUMENT 11.2.1. ARGUMENT
struct COPY_ABORT4args { struct COPY_ABORT4args {
/* CURRENT_FH: desination file */ /* CURRENT_FH: desination file */
stateid4 caa_stateid; stateid4 caa_stateid;
}; };
10.2.2. RESULT 11.2.2. RESULT
struct COPY_ABORT4res { struct COPY_ABORT4res {
nfsstat4 car_status; nfsstat4 car_status;
}; };
10.2.3. DESCRIPTION 11.2.3. DESCRIPTION
COPY_ABORT is used for both intra- and inter-server asynchronous COPY_ABORT is used for both intra- and inter-server asynchronous
copies. The COPY_ABORT operation allows the client to cancel a copies. The COPY_ABORT operation allows the client to cancel a
server-side copy operation that it initiated. This operation is sent server-side copy operation that it initiated. This operation is sent
in a COMPOUND request from the client to the destination server. in a COMPOUND request from the client to the destination server.
This operation may be used to cancel a copy when the application that This operation may be used to cancel a copy when the application that
requested the copy exits before the operation is completed or for requested the copy exits before the operation is completed or for
some other reason. some other reason.
The request contains the filehandle and copy stateid cookies that act The request contains the filehandle and copy stateid cookies that act
skipping to change at page 65, line 27 skipping to change at page 67, line 42
NFS4ERR_RETRY: The abort failed, but a retry at some time in the NFS4ERR_RETRY: The abort failed, but a retry at some time in the
future MAY succeed. future MAY succeed.
NFS4ERR_COMPLETE_ALREADY: The abort failed, and a callback will NFS4ERR_COMPLETE_ALREADY: The abort failed, and a callback will
deliver the results of the copy operation. deliver the results of the copy operation.
NFS4ERR_SERVERFAULT: An error occurred on the server that does not NFS4ERR_SERVERFAULT: An error occurred on the server that does not
map to a specific error code. map to a specific error code.
10.3. Operation 61: COPY_NOTIFY - Notify a source server of a future 11.3. Operation 61: COPY_NOTIFY - Notify a source server of a future
copy copy
10.3.1. ARGUMENT 11.3.1. ARGUMENT
struct COPY_NOTIFY4args { struct COPY_NOTIFY4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: source file */
netloc4 cna_destination_server; netloc4 cna_destination_server;
}; };
10.3.2. RESULT 11.3.2. RESULT
struct COPY_NOTIFY4resok { struct COPY_NOTIFY4resok {
nfstime4 cnr_lease_time; nfstime4 cnr_lease_time;
netloc4 cnr_source_server<>; netloc4 cnr_source_server<>;
}; };
union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
case NFS4_OK: case NFS4_OK:
COPY_NOTIFY4resok resok4; COPY_NOTIFY4resok resok4;
default: default:
void; void;
}; };
10.3.3. DESCRIPTION 11.3.3. DESCRIPTION
This operation is used for an inter-server copy. A client sends this This operation is used for an inter-server copy. A client sends this
operation in a COMPOUND request to the source server to authorize a operation in a COMPOUND request to the source server to authorize a
destination server identified by cna_destination_server to read the destination server identified by cna_destination_server to read the
file specified by CURRENT_FH on behalf of the given user. file specified by CURRENT_FH on behalf of the given user.
The cna_destination_server MUST be specified using the netloc4 The cna_destination_server MUST be specified using the netloc4
network location format. The server is not required to resolve the network location format. The server is not required to resolve the
cna_destination_server address before completing this operation. cna_destination_server address before completing this operation.
skipping to change at page 68, line 11 skipping to change at page 70, line 11
present on the source server. The client can determine the present on the source server. The client can determine the
correct location and reissue the operation with the correct correct location and reissue the operation with the correct
location. location.
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the NFS4ERR_NOTSUPP: The copy offload operation is not supported by the
NFS server receiving this request. NFS server receiving this request.
NFS4ERR_WRONGSEC: The security mechanism being used by the client NFS4ERR_WRONGSEC: The security mechanism being used by the client
does not match the server's security policy. does not match the server's security policy.
10.4. Operation 62: COPY_REVOKE - Revoke a destination server's copy 11.4. Operation 62: COPY_REVOKE - Revoke a destination server's copy
privileges privileges
10.4.1. ARGUMENT 11.4.1. ARGUMENT
struct COPY_REVOKE4args { struct COPY_REVOKE4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: source file */
netloc4 cra_destination_server; netloc4 cra_destination_server;
}; };
10.4.2. RESULT 11.4.2. RESULT
struct COPY_REVOKE4res { struct COPY_REVOKE4res {
nfsstat4 crr_status; nfsstat4 crr_status;
}; };
10.4.3. DESCRIPTION 11.4.3. DESCRIPTION
This operation is used for an inter-server copy. A client sends this This operation is used for an inter-server copy. A client sends this
operation in a COMPOUND request to the source server to revoke the operation in a COMPOUND request to the source server to revoke the
authorization of a destination server identified by authorization of a destination server identified by
cra_destination_server from reading the file specified by CURRENT_FH cra_destination_server from reading the file specified by CURRENT_FH
on behalf of given user. If the cra_destination_server has already on behalf of given user. If the cra_destination_server has already
begun copying the file, a successful return from this operation begun copying the file, a successful return from this operation
indicates that further access will be prevented. indicates that further access will be prevented.
The cra_destination_server MUST be specified using the netloc4 The cra_destination_server MUST be specified using the netloc4
skipping to change at page 69, line 16 skipping to change at page 71, line 16
a partial list): a partial list):
NFS4ERR_MOVED: The file system which contains the source file is not NFS4ERR_MOVED: The file system which contains the source file is not
present on the source server. The client can determine the present on the source server. The client can determine the
correct location and reissue the operation with the correct correct location and reissue the operation with the correct
location. location.
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the NFS4ERR_NOTSUPP: The copy offload operation is not supported by the
NFS server receiving this request. NFS server receiving this request.
10.5. Operation 63: COPY_STATUS - Poll for status of a server-side copy 11.5. Operation 63: COPY_STATUS - Poll for status of a server-side copy
10.5.1. ARGUMENT 11.5.1. ARGUMENT
struct COPY_STATUS4args { struct COPY_STATUS4args {
/* CURRENT_FH: destination file */ /* CURRENT_FH: destination file */
stateid4 csa_stateid; stateid4 csa_stateid;
}; };
10.5.2. RESULT 11.5.2. RESULT
struct COPY_STATUS4resok { struct COPY_STATUS4resok {
length4 csr_bytes_copied; length4 csr_bytes_copied;
nfsstat4 csr_complete<1>; nfsstat4 csr_complete<1>;
}; };
union COPY_STATUS4res switch (nfsstat4 csr_status) { union COPY_STATUS4res switch (nfsstat4 csr_status) {
case NFS4_OK: case NFS4_OK:
COPY_STATUS4resok resok4; COPY_STATUS4resok resok4;
default: default:
void; void;
}; };
10.5.3. DESCRIPTION 11.5.3. DESCRIPTION
COPY_STATUS is used for both intra- and inter-server asynchronous COPY_STATUS is used for both intra- and inter-server asynchronous
copies. The COPY_STATUS operation allows the client to poll the copies. The COPY_STATUS operation allows the client to poll the
server to determine the status of an asynchronous copy operation. server to determine the status of an asynchronous copy operation.
This operation is sent by the client to the destination server. This operation is sent by the client to the destination server.
If this operation is successful, the number of bytes copied are If this operation is successful, the number of bytes copied are
returned to the client in the csr_bytes_copied field. The returned to the client in the csr_bytes_copied field. The
csr_bytes_copied value indicates the number of bytes copied but not csr_bytes_copied value indicates the number of bytes copied but not
which specific bytes have been copied. which specific bytes have been copied.
skipping to change at page 70, line 29 skipping to change at page 72, line 29
NFS4ERR_NOTSUPP: The copy status operation is not supported by the NFS4ERR_NOTSUPP: The copy status operation is not supported by the
NFS server receiving this request. NFS server receiving this request.
NFS4ERR_BAD_STATEID: The stateid is not valid (see Section 2.3.2 NFS4ERR_BAD_STATEID: The stateid is not valid (see Section 2.3.2
below). below).
NFS4ERR_EXPIRED: The stateid has expired (see Copy Offload Stateid NFS4ERR_EXPIRED: The stateid has expired (see Copy Offload Stateid
section below). section below).
10.6. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID 11.6. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID
10.6.1. ARGUMENT 11.6.1. ARGUMENT
/* new */ /* new */
const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004; const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004;
10.6.2. RESULT 11.6.2. RESULT
Unchanged Unchanged
10.6.3. MOTIVATION 11.6.3. MOTIVATION
Enterprise applications require guarantees that an operation has Enterprise applications require guarantees that an operation has
either aborted or completed. NFSv4.1 provides this guarantee as long either aborted or completed. NFSv4.1 provides this guarantee as long
as the session is alive: simply send a SEQUENCE operation on the same as the session is alive: simply send a SEQUENCE operation on the same
slot with a new sequence number, and the successful return of slot with a new sequence number, and the successful return of
SEQUENCE indicates the previous operation has completed. However, if SEQUENCE indicates the previous operation has completed. However, if
the session is lost, there is no way to know when any in progress the session is lost, there is no way to know when any in progress
operations have aborted or completed. In hindsight, the NFSv4.1 operations have aborted or completed. In hindsight, the NFSv4.1
specification should have mandated that DESTROY_SESSION abort/ specification should have mandated that DESTROY_SESSION abort/
complete all outstanding operations. complete all outstanding operations.
10.6.4. DESCRIPTION 11.6.4. DESCRIPTION
A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
when it sends an EXCHANGE_ID operation. The server SHOULD set this when it sends an EXCHANGE_ID operation. The server SHOULD set this
capability in the EXCHANGE_ID reply whether the client requests it or capability in the EXCHANGE_ID reply whether the client requests it or
not. If the client ID is created with this capability then the not. If the client ID is created with this capability then the
following will occur: following will occur:
o The server will not reply to DESTROY_SESSION until all operations o The server will not reply to DESTROY_SESSION until all operations
in progress are completed or aborted. in progress are completed or aborted.
skipping to change at page 71, line 34 skipping to change at page 73, line 34
sessions, opens, locks, delegations, layouts, and/or wants are sessions, opens, locks, delegations, layouts, and/or wants are
deleted. deleted.
o The NFS server SHOULD support client ID trunking, and if it does o The NFS server SHOULD support client ID trunking, and if it does
and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
session ID created on one node of the storage cluster MUST be session ID created on one node of the storage cluster MUST be
destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID
and an EXCHANGE_ID with a new verifier affects all sessions and an EXCHANGE_ID with a new verifier affects all sessions
regardless what node the sessions were created on. regardless what node the sessions were created on.
10.7. Operation 64: INITIALIZE 11.7. Operation 64: INITIALIZE
This operation can be used to initialize the structure imposed by an This operation can be used to initialize the structure imposed by an
application onto a file and to punch a hole into a file. application onto a file and to punch a hole into a file.
The server has no concept of the structure imposed by the The server has no concept of the structure imposed by the
application. It is only when the application writes to a section of application. It is only when the application writes to a section of
the file does order get imposed. In order to detect corruption even the file does order get imposed. In order to detect corruption even
before the application utilizes the file, the application will want before the application utilizes the file, the application will want
to initialize a range of ADBs. It uses the INITIALIZE operation to to initialize a range of ADBs. It uses the INITIALIZE operation to
do so. do so.
10.7.1. ARGUMENT 11.7.1. ARGUMENT
/* /*
* We use data_content4 in case we wish to * We use data_content4 in case we wish to
* extend new types later. Note that we * extend new types later. Note that we
* are explicitly disallowing data. * are explicitly disallowing data.
*/ */
union initialize_arg4 switch (data_content4 content) { union initialize_arg4 switch (data_content4 content) {
case NFS4_CONTENT_APP_BLOCK: case NFS4_CONTENT_APP_BLOCK:
app_data_block4 ia_adb; app_data_block4 ia_adb;
case NFS4_CONTENT_HOLE: case NFS4_CONTENT_HOLE:
hole_info4 ia_hole; data_info4 ia_hole;
default: default:
void; void;
}; };
struct INITIALIZE4args { struct INITIALIZE4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 ia_stateid; stateid4 ia_stateid;
stable_how4 ia_stable; stable_how4 ia_stable;
initialize_arg4 ia_data<>; initialize_arg4 ia_data<>;
}; };
10.7.2. RESULT 11.7.2. RESULT
struct INITIALIZE4resok { struct INITIALIZE4resok {
count4 ir_count; count4 ir_count;
stable_how4 ir_committed; stable_how4 ir_committed;
verifier4 ir_writeverf; verifier4 ir_writeverf;
data_content4 ir_sparse; data_content4 ir_sparse;
}; };
union INITIALIZE4res switch (nfsstat4 status) { union INITIALIZE4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
INITIALIZE4resok resok4; INITIALIZE4resok resok4;
default: default:
void; void;
}; };
10.7.3. DESCRIPTION 11.7.3. DESCRIPTION
When the client invokes the INITIALIZE operation, it has two desired When the client invokes the INITIALIZE operation, it has two desired
results: results:
1. The structure described by the app_data_block4 be imposed on the 1. The structure described by the app_data_block4 be imposed on the
file. file.
2. The contents described by the app_data_block4 be sparse. 2. The contents described by the app_data_block4 be sparse.
If the server supports the INITIALIZE operation, it still might not If the server supports the INITIALIZE operation, it still might not
support sparse files. So if it receives the INITIALIZE operation, support sparse files. So if it receives the INITIALIZE operation,
then it MUST populate the contents of the file with the initialized then it MUST populate the contents of the file with the initialized
ADBs. In other words, if the server supports INITIALIZE, then it ADBs. In other words, if the server supports INITIALIZE, then it
supports the concept of ADBs. [[Comment.8: Do we want to support an supports the concept of ADBs. [[Comment.7: Do we want to support an
asynchronous INITIALIZE? Do we have to? --TH]] asynchronous INITIALIZE? Do we have to? --TH]]
If the data was already initialized, There are two interesting If the data was already initialized, There are two interesting
scenarios: scenarios:
1. The data blocks are allocated. 1. The data blocks are allocated.
2. Initializing in the middle of an existing ADB. 2. Initializing in the middle of an existing ADB.
If the data blocks were already allocated, then the INITIALIZE is a If the data blocks were already allocated, then the INITIALIZE is a
hole punch operation. If INITIALIZE supports sparse files, then the hole punch operation. If INITIALIZE supports sparse files, then the
data blocks are to be deallocated. If not, then the data blocks are data blocks are to be deallocated. If not, then the data blocks are
to be rewritten in the indicated ADB format. [[Comment.9: Need to to be rewritten in the indicated ADB format. [[Comment.8: Need to
document interaction between space reservation and hole punching? document interaction between space reservation and hole punching?
--TH]] --TH]]
Since the server has no knowledge of ADBs, it should not report Since the server has no knowledge of ADBs, it should not report
misaligned creation of ADBs. Even while it can detect them, it misaligned creation of ADBs. Even while it can detect them, it
cannot disallow them, as the application might be in the process of cannot disallow them, as the application might be in the process of
changing the size of the ADBs. Thus the server must be prepared to changing the size of the ADBs. Thus the server must be prepared to
handle an INITIALIZE into an existing ADB. handle an INITIALIZE into an existing ADB.
This document does not mandate the manner in which the server stores This document does not mandate the manner in which the server stores
ADBs sparsely for a file. It does assume that if ADBs are stored ADBs sparsely for a file. It does assume that if ADBs are stored
sparsely, then the server can detect when an INITIALIZE arrives that sparsely, then the server can detect when an INITIALIZE arrives that
will force a new ADB to start inside an existing ADB. For example, will force a new ADB to start inside an existing ADB. For example,
assume that ADBi has a adb_block_size of 4k and that an INITIALIZE assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
starts 1k inside ADBi. The server should [[Comment.10: Need to flesh starts 1k inside ADBi. The server should [[Comment.9: Need to flesh
this out. --TH]] this out. --TH]]
10.7.3.1. Hole punching 11.7.3.1. Hole punching
Whenever a client wishes to deallocate the blocks backing a Whenever a client wishes to deallocate the blocks backing a
particular region in the file, it calls the INITIALIZE operation with particular region in the file, it calls the INITIALIZE operation with
the current filehandle set to the filehandle of the file in question, the current filehandle set to the filehandle of the file in question,
start offset and length in bytes of the region set in hpa_offset and start offset and length in bytes of the region set in hpa_offset and
hpa_count respectively. All further reads to this region MUST return hpa_count respectively. All further reads to this region MUST return
zeros until overwritten. The filehandle specified must be that of a zeros until overwritten. The filehandle specified must be that of a
regular file. regular file.
Situations may arise where ia_hole.hi_offset and/or ia_hole.hi_offset Situations may arise where ia_hole.hi_offset and/or ia_hole.hi_offset
skipping to change at page 74, line 47 skipping to change at page 76, line 47
NFS4ERR_NOTSUPP The Hole punch operations are not supported by the NFS4ERR_NOTSUPP The Hole punch operations are not supported by the
NFS server receiving this request. NFS server receiving this request.
NFS4ERR_DIR The current filehandle is of type NF4DIR. NFS4ERR_DIR The current filehandle is of type NF4DIR.
NFS4ERR_SYMLINK The current filehandle is of type NF4LNK. NFS4ERR_SYMLINK The current filehandle is of type NF4LNK.
NFS4ERR_WRONG_TYPE The current filehandle does not designate an NFS4ERR_WRONG_TYPE The current filehandle does not designate an
ordinary file. ordinary file.
10.8. Changes to Operation 51: LAYOUTRETURN 11.8. Operation 67: IO_ADVISE - Application I/O access pattern hints
10.8.1. Introduction
This section introduces a new operation, named IO_ADVISE, which
allows NFS clients to communicate application I/O access pattern
hints to the NFS server. This new operation will allow hints to be
sent to the server when applications use posix_fadvise, direct I/O,
or at any other point at which the client finds useful.
11.8.1. ARGUMENT
enum IO_ADVISE_type4 {
IO_ADVISE4_NORMAL = 0,
IO_ADVISE4_SEQUENTIAL = 1,
IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2,
IO_ADVISE4_RANDOM = 3,
IO_ADVISE4_WILLNEED = 4,
IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5,
IO_ADVISE4_DONTNEED = 6,
IO_ADVISE4_NOREUSE = 7,
IO_ADVISE4_READ = 8,
IO_ADVISE4_WRITE = 9
};
struct IO_ADVISE4args {
/* CURRENT_FH: file */
stateid4 iar_stateid;
offset4 iar_offset;
length4 iar_count;
bitmap4 iar_hints;
};
11.8.2. RESULT
struct IO_ADVISE4resok {
bitmap4 ior_hints;
};
union IO_ADVISE4res switch (nfsstat4 _status) {
case NFS4_OK:
IO_ADVISE4resok resok4;
default:
void;
};
11.8.3. DESCRIPTION
The IO_ADVISE operation sends an I/O access pattern hint to the
server for the owner of stated for a given byte range specified by
iar_offset and iar_count. The byte range specified by iar_offset and
iar_count need not currently exist in the file, but the iar_hints
will apply to the byte range when it does exist. If iar_count is 0,
all data following iar_offset is specified. The server MAY ignore
the advice.
The following are the possible hints:
IO_ADVISE4_NORMAL Specifies that the application has no advice to
give on its behavior with respect to the specified data. It is
the default characteristic if no advice is given.
IO_ADVISE4_SEQUENTIAL Specifies that the stated holder expects to
access the specified data sequentially from lower offsets to
higher offsets.
IO_ADVISE4_SEQUENTIAL BACKWARDS Specifies that the stated holder
expects to access the specified data sequentially from higher
offsets to lower offsets.
IO_ADVISE4_RANDOM Specifies that the stated holder expects to access
the specified data in a random order.
IO_ADVISE4_WILLNEED Specifies that the stated holder expects to
access the specified data in the near future.
IO_ADVISE4_WILLNEED_OPPORTUNISTIC Specifies that the stated holder
expects to possibly access the data in the near future. This is a
speculative hint, and therefore the server should prefetch data or
indirect blocks only if it can be done at a marginal cost.
IO_ADVISE_DONTNEED Specifies that the stated holder expects that it
will not access the specified data in the near future.
IO_ADVISE_NOREUSE Specifies that the stated holder expects to access
the specified data once and then not reuse it thereafter.
IO_ADVISE4_READ Specifies that the stated holder expects to read the
specified data in the near future.
IO_ADVISE4_WRITE Specifies that the stated holder expects to write
the specified data in the near future.
The server will return success if the operation is properly formed,
otherwise the server will return an error. The server MUST NOT
return an error if it does not recognize or does not support the
requested advice. This is also true even if the client sends
contradictory hints to the server, e.g., IO_ADVISE4_SEQUENTIAL and
IO_ADVISE4_RANDOM in a single IO_ADVISE operation. In this case, the
server MUST return success and a ior_hints value that indicates the
hint it intends to optimize. For contradictory hints, this may mean
simply returning IO_ADVISE4_NORMAL for example.
The ior_hints returned by the server is primarily for debugging
purposes since the server is under no obligation to carry out the
hints that it describes in the ior_hints result. In addition, while
the server may have intended to implement the hints returned in
ior_hints, as time progresses, the server may need to change its
handling of a given file due to several reasons including, but not
limited to, memory pressure, additional IO_ADVISE hints sent by other
clients, and heuristically detected file access patterns.
The server MAY return different advice than what the client
requested. If it does, then this might be due to one of several
conditions, including, but not limited to another client advising of
a different I/O access pattern; a different I/O access pattern from
another client that that the server has heuristically detected; or
the server is not able to support the requested I/O access pattern,
perhaps due to a temporary resource limitation.
Each issuance of the IO_ADVISE operation overrides all previous
issuances of IO_ADVISE for a given byte range. This effectively
follows a strategy of last hint wins for a given stated and byte
range.
Clients should assume that hints included in an IO_ADVISE operation
will be forgotten once the file is closed.
11.8.4. IMPLEMENTATION
The NFS client may choose to issue and IO_ADVISE operation to the
server in several different instances.
The most obvious is in direct response to an applications execution
of posix_fadvise. In this case, IO_ADVISE4_WRITE and IO_ADVISE4_READ
may be set based upon the type of file access specified when the file
was opened.
Another useful point would be when an application indicates it is
using direct I/O. Direct I/O may be specified at file open, in which
case a IO_ADVISE may be included in the same compound as the OPEN
operation with the IO_ADVISE4_NOREUSE flag set. Direct I/O may also
be specified separately, in which case a IO_ADVISE operation can be
sent to the server separately. As above, IO_ADVISE4_WRITE and
IO_ADVISE4_READ may be set based upon the type of file access
specified when the file was opened.
11.8.5. pNFS File Layout Data Type Considerations
The IO_ADVISE considerations for pNFS are very similar to the COMMIT
considerations for pNFS. That is, as with COMMIT, some NFS server
implementations prefer IO_ADVISE be done on the DS, and some prefer
it be done on the MDS.
So for the file's layout type, it is proposed that NFSv4.2 include an
additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on
NFSv4.2 or higher. Any file's layout obtained with NFSv4.1 MUST NOT
have NFL42_UFLG_IO_ADVISE_THRU_MDS set. Any file's layout obtained
with NFSv4.2 MAY have NFL42_UFLG_IO_ADVISE_THRU_MDS set. If the
client does not implement IO_ADVISE, then it MUST ignore
NFL42_UFLG_IO_ADVISE_THRU_MDS.
If NFL42_UFLG_IO_ADVISE_THRU_MDS is set, then if the client
implements IO_ADVISE, then if it wants the DS to honor IO_ADVISE, the
client MUST send the operation to the MDS, and the server will
communicate the advice back each DS. If the client sends IO_ADVISE
to the DS, then the server MAY return NFS4ERR_NOTSUPP.
If NFL42_UFLG_IO_ADVISE_THRU_MDS is not set, then this indicates to
client that if wants to inform the server via IO_ADVISE of the
client's intended use of the file, then the client SHOULD send an
IO_ADVISE to each DS. While the client MAY always send IO_ADVISE to
the MDS, if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the
client should expect that such an IO_ADVISE is futile. Note that a
client SHOULD use the same set of arguments on each IO_ADVISE sent to
a DS for the same open file reference.
The server is not required to support different advice for different
DS's with the same open file reference.
11.8.5.1. Dense and Sparse Packing Considerations
The IO_ADVISE operation MUST use the iar_offset and byte range as
dictated by the presence or absence of NFL4_UFLG_DENSE.
E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS
for iar_offset 0 really means iar_offset 10000 in the logical file,
then an IO_ADVISE for iar_offset 0 means iar_offset 10000.
E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS
for iar_offset 0 really means iar_offset 0 in the logical file, then
an IO_ADVISE for iar_offset 0 means iar_offset 0 in the logical file.
E.g., if NFL4_UFLG_DENSE is present, the stripe unit is 1000 bytes
and the stripe count is 10, and the dense DS file is serving
iar_offset 0. A READ or WRITE to the DS for iar_offsets 0, 1000,
2000, and 3000, really mean iar_offsets 10000, 20000, 30000, and
40000 (implying a stripe count of 10 and a stripe unit of 1000), then
an IO_ADVISE sent to the same DS with an iar_offset of 500, and a
iar_count of 3000 means that the IO_ADVISE applies to these byte
ranges of the dense DS file:
- 500 to 999
- 1000 to 1999
- 2000 to 2999
- 3000 to 3499
I.e., the contiguous range 500 to 3499 as specified in IO_ADVISE.
It also applies to these byte ranges of the logical file:
- 10500 to 10999 (500 bytes)
- 20000 to 20999 (1000 bytes)
- 30000 to 30999 (1000 bytes)
- 40000 to 40499 (500 bytes)
(total 3000 bytes)
E.g., if NFL4_UFLG_DENSE is absent, the stripe unit is 250 bytes, the
stripe count is 4, and the sparse DS file is serving iar_offset 0.
Then a READ or WRITE to the DS for iar_offsets 0, 1000, 2000, and
3000, really mean iar_offsets 0, 1000, 2000, and 3000 in the logical
file, keeping in mind that on the DS file,. byte ranges 250 to 999,
1250 to 1999, 2250 to 2999, and 3250 to 3999 are not accessible.
Then an IO_ADVISE sent to the same DS with an iar_offset of 500, and
a iar_count of 3000 means that the IO_ADVISE applies to these byte
ranges of the logical file and the sparse DS file:
- 500 to 999 (500 bytes) - no effect
- 1000 to 1249 (250 bytes) - effective
- 1250 to 1999 (750 bytes) - no effect
- 2000 to 2249 (250 bytes) - effective
- 2250 to 2999 (750 bytes) - no effect
- 3000 to 3249 (250 bytes) - effective
- 3250 to 3499 (250 bytes) - no effect
(subtotal 2250 bytes) - no effect
(subtotal 750 bytes) - effective
(grand total 3000 bytes) - no effect + effective
If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and
NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request
sent to the data server with a byte range that overlaps stripe unit
that the data server does not serve MUST NOT result in the status
NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and
if the server applies IO_ADVISE hints on any stripe units that
overlap with the specified range, those hints SHOULD be indicated in
the response.
11.8.6. Number of Supported File Segments
In theory IO_ADVISE allows a client and server to support multiple
file segments, meaning that different, possibly overlapping, byte
ranges of the same open file reference will support different hints.
This is not practical, and in general the server will support just
one set of hints, and these will apply to the entire file. However,
there are some hints that very ephemeral, and are essentially amount
to one time instructions to the NFS server, which will be forgotten
momentarily after IO_ADVISE is executed.
The following hints will always apply to the entire file, regardless
of the specified byte range:
o IO_ADVISE4_NORMAL
o IO_ADVISE4_SEQUENTIAL
o IO_ADVISE4_SEQUENTIAL_BACKWARDS
o IO_ADVISE4_RANDOM
The following hints will always apply to specified byte range, and
will treated as one time instructions:
o IO_ADVISE4_WILLNEED
o IO_ADVISE4_WILLNEED_OPPORTUNISTIC
o IO_ADVISE4_DONTNEED
o IO_ADVISE4_NOREUSE
The following hints are modifiers to all other hints, and will apply
to the entire file and/or to a one time instruction on the specified
byte range:
o IO_ADVISE4_READ
o IO_ADVISE4_WRITE
11.8.7. Possible Additional Hint - IO_ADVISE4_RECENTLY_USED
IO_ADVISE4_RECENTLY_USED The client has recently accessed the byte
range in its own cache. This informs the server that the data in
the byte range remains important to the client. When the server
reaches resource exhaustion, knowing which data is more important
allows the server to make better choices about which data to, for
example purge from a cache, or move to secondary storage. It also
informs the server which delegations are more important, since if
delegations are working correctly, once delegated to a client, a
server might never receive another I/O request for the file.
A use case for this hint is that of the NFS client or application
restart. In the event of restart, the app's/client's cache will be
cold and it will need to fill it from the server. If the server is
maintaining a list (LRU most likely) of byte ranges tagged with
IO_ADVISE4_RECENTLY_USED, then the server could have stored the data
in these ranges into a storage medium that is less expensive than
DRAM, and faster than random access magnetic or optical media, such
as flash. This allows the end to end application to storage system
to co-operate to meet a service level agreement/objective contracted
to the end user by the IT provider.
On the other side, this is effectively a hint regarding multi-level
caching, and it may be more useful to specify a more formal multi-
level caching system. In addition, the action to be taken by the
server file system with this hint, and hence its usefulness, is
unclear. For example, as most clients already cache data that they
know is important, having this data cached twice may be unnecessary.
In fact, substantial performance improvements have been demonstrated
by making caches more exclusive between each other [25], not the
other way around. This means that there is a strong argument to be
made that servers should immediately purge the described cached data
upon receiving this hint. Other work showed that even infinite sized
secondary caches can be largely ineffective [26], but this of course
is subject to the workload.
11.9. Changes to Operation 51: LAYOUTRETURN
11.9.1. Introduction
In the pNFS description provided in [2], the client is not enabled to In the pNFS description provided in [2], the client is not enabled to
relay an error code from the DS to the MDS. In the specification of relay an error code from the DS to the MDS. In the specification of
the Objects-Based Layout protocol [7], use is made of the opaque the Objects-Based Layout protocol [8], use is made of the opaque
lrf_body field of the LAYOUTRETURN argument to do such a relaying of lrf_body field of the LAYOUTRETURN argument to do such a relaying of
error codes. In this section, we define a new data structure to error codes. In this section, we define a new data structure to
enable the passing of error codes back to the MDS and provide some enable the passing of error codes back to the MDS and provide some
guidelines on what both the client and MDS should expect in such guidelines on what both the client and MDS should expect in such
circumstances. circumstances.
There are two broad classes of errors, transient and persistent. The There are two broad classes of errors, transient and persistent. The
client SHOULD strive to only use this new mechanism to report client SHOULD strive to only use this new mechanism to report
persistent errors. It MUST be able to deal with transient issues by persistent errors. It MUST be able to deal with transient issues by
itself. Also, while the client might consider an issue to be itself. Also, while the client might consider an issue to be
skipping to change at page 75, line 34 skipping to change at page 84, line 25
hard error. The MDS on the other hand, is waiting for the client to hard error. The MDS on the other hand, is waiting for the client to
report such an error. For it, the mission is accomplished in that report such an error. For it, the mission is accomplished in that
the client has returned a layout that the MDS had most likley the client has returned a layout that the MDS had most likley
recalled. recalled.
The existing LAYOUTRETURN operation is extended by introducing a new The existing LAYOUTRETURN operation is extended by introducing a new
data structure to report errors, layoutreturn_device_error4. Also, data structure to report errors, layoutreturn_device_error4. Also,
layoutreturn_device_error4 is introduced to enable an array of errors layoutreturn_device_error4 is introduced to enable an array of errors
to be reported. to be reported.
10.8.2. ARGUMENT 11.9.2. ARGUMENT
The ARGUMENT specification of the LAYOUTRETURN operation in section The ARGUMENT specification of the LAYOUTRETURN operation in section
18.44.1 of [2] is augmented by the following XDR code [22]: 18.44.1 of [2] is augmented by the following XDR code [24]:
struct layoutreturn_device_error4 { struct layoutreturn_device_error4 {
deviceid4 lrde_deviceid; deviceid4 lrde_deviceid;
nfsstat4 lrde_status; nfsstat4 lrde_status;
nfs_opnum4 lrde_opnum; nfs_opnum4 lrde_opnum;
}; };
struct layoutreturn_error_report4 { struct layoutreturn_error_report4 {
layoutreturn_device_error4 lrer_errors<>; layoutreturn_device_error4 lrer_errors<>;
}; };
10.8.3. RESULT 11.9.3. RESULT
The RESULT of the LAYOUTRETURN operation is unchanged; see section The RESULT of the LAYOUTRETURN operation is unchanged; see section
18.44.2 of [2]. 18.44.2 of [2].
10.8.4. DESCRIPTION 11.9.4. DESCRIPTION
The following text is added to the end of the LAYOUTRETURN operation The following text is added to the end of the LAYOUTRETURN operation
DESCRIPTION in section 18.44.3 of [2]. DESCRIPTION in section 18.44.3 of [2].
When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
then if the lrf_body field is NULL, it indicates to the MDS that the then if the lrf_body field is NULL, it indicates to the MDS that the
client experienced no errors. If lrf_body is non-NULL, then the client experienced no errors. If lrf_body is non-NULL, then the
field references error information which is layout type specific. field references error information which is layout type specific.
I.e., the Objects-Based Layout protocol can continue to utilize I.e., the Objects-Based Layout protocol can continue to utilize
lrf_body as specified in [7]. For both Files-Based Layouts, the lrf_body as specified in [8]. For both Files-Based Layouts, the
field references a layoutreturn_device_error4, which contains an field references a layoutreturn_device_error4, which contains an
array of layoutreturn_device_error4. array of layoutreturn_device_error4.
Each individual layoutreturn_device_error4 descibes a single error Each individual layoutreturn_device_error4 descibes a single error
associated with a DS, which is identfied via lrde_deviceid. The associated with a DS, which is identfied via lrde_deviceid. The
operation which returned the error is identified via lrde_opnum. operation which returned the error is identified via lrde_opnum.
Finally the NFS error value (nfsstat4) encountered is provided via Finally the NFS error value (nfsstat4) encountered is provided via
lrde_status and may consist of the following error codes: lrde_status and may consist of the following error codes:
NFS4_OKAY: No issues were found for this device. NFS4_OKAY: No issues were found for this device.
NFS4ERR_NXIO: The client was unable to establish any communication NFS4ERR_NXIO: The client was unable to establish any communication
with the DS. with the DS.
NFS4ERR_*: The client was able to establish communication with the NFS4ERR_*: The client was able to establish communication with the
DS and is returning one of the allowed error codes for the DS and is returning one of the allowed error codes for the
operation denoted by lrde_opnum. operation denoted by lrde_opnum.
10.8.5. IMPLEMENTATION 11.9.5. IMPLEMENTATION
The following text is added to the end of the LAYOUTRETURN operation The following text is added to the end of the LAYOUTRETURN operation
IMPLEMENTATION in section 18.4.4 of [2]. IMPLEMENTATION in section 18.4.4 of [2].
A client that expects to use pNFS for a mounted filesystem SHOULD A client that expects to use pNFS for a mounted filesystem SHOULD
check for pNFS support at mount time. This check SHOULD be performed check for pNFS support at mount time. This check SHOULD be performed
by sending a GETDEVICELIST operation, followed by layout-type- by sending a GETDEVICELIST operation, followed by layout-type-
specific checks for accessibility of each storage device returned by specific checks for accessibility of each storage device returned by
GETDEVICELIST. If the NFS server does not support pNFS, the GETDEVICELIST. If the NFS server does not support pNFS, the
GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP
skipping to change at page 77, line 22 skipping to change at page 86, line 9
When an I/O fails to a storage device, the client SHOULD retry the When an I/O fails to a storage device, the client SHOULD retry the
failed I/O via the MDS. In this situation, before retrying the I/O, failed I/O via the MDS. In this situation, before retrying the I/O,
the client SHOULD return the layout, or the affected portion thereof, the client SHOULD return the layout, or the affected portion thereof,
and SHOULD indicate which storage device or devices was problematic. and SHOULD indicate which storage device or devices was problematic.
If the client does not do this, the MDS may issue a layout recall If the client does not do this, the MDS may issue a layout recall
callback in order to perform the retried I/O. callback in order to perform the retried I/O.
The client needs to be cognizant that since this error handling is The client needs to be cognizant that since this error handling is
optional in the MDS, the MDS may silently ignore this functionality. optional in the MDS, the MDS may silently ignore this functionality.
Also, as the MDS may consider some issues the client reports to be Also, as the MDS may consider some issues the client reports to be
expected (see Section 10.8.1), the client might find it difficult to expected (see Section 11.9.1), the client might find it difficult to
detect a MDS which has not implemented error handling via detect a MDS which has not implemented error handling via
LAYOUTRETURN. LAYOUTRETURN.
If an MDS is aware that a storage device is proving problematic to a If an MDS is aware that a storage device is proving problematic to a
client, the MDS SHOULD NOT include that storage device in any pNFS client, the MDS SHOULD NOT include that storage device in any pNFS
layouts sent to that client. If the MDS is aware that a storage layouts sent to that client. If the MDS is aware that a storage
device is affecting many clients, then the MDS SHOULD NOT include device is affecting many clients, then the MDS SHOULD NOT include
that storage device in any pNFS layouts sent out. Clients must still that storage device in any pNFS layouts sent out. Clients must still
be aware that the MDS might not have any choice in using the storage be aware that the MDS might not have any choice in using the storage
device, i.e., there might only be one possible layout for the system. device, i.e., there might only be one possible layout for the system.
skipping to change at page 78, line 5 skipping to change at page 86, line 38
using the problematic storage devices in layouts for that client, but using the problematic storage devices in layouts for that client, but
the MDS is not required to indefinitely retain per-client storage the MDS is not required to indefinitely retain per-client storage
device error information. An MDS is also not required to device error information. An MDS is also not required to
automatically reinstate use of a previously problematic storage automatically reinstate use of a previously problematic storage
device; administrative intervention may be required instead. device; administrative intervention may be required instead.
A client MAY perform I/O via the MDS even when the client holds a A client MAY perform I/O via the MDS even when the client holds a
layout that covers the I/O; servers MUST support this client layout that covers the I/O; servers MUST support this client
behavior, and MAY recall layouts as needed to complete I/Os. behavior, and MAY recall layouts as needed to complete I/Os.
10.9. Operation 65: READ_PLUS 11.10. Operation 65: READ_PLUS
If the client sends a READ operation, it is explicitly stating that If the client sends a READ operation, it is explicitly stating that
it is not supporting sparse files. So if a READ occurs on a sparse it is not supporting sparse files. So if a READ occurs on a sparse
ADB, then the server must expand such ADBs to be raw bytes. If a ADB, then the server must expand such ADBs to be raw bytes. If a
READ occurs in the middle of an ADB, the server can only send back READ occurs in the middle of an ADB, the server can only send back
bytes starting from that offset. bytes starting from that offset.
Such an operation is inefficient for transfer of sparse sections of Such an operation is inefficient for transfer of sparse sections of
the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead, the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead,
a client should issue READ_PLUS. Note that as the client has no a a client should issue READ_PLUS. Note that as the client has no a
priori knowledge of whether an ADB is present or not, it should priori knowledge of whether an ADB is present or not, it should
always use READ_PLUS. always use READ_PLUS.
10.9.1. ARGUMENT 11.10.1. ARGUMENT
struct READ_PLUS4args { struct READ_PLUS4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 rpa_stateid; stateid4 rpa_stateid;
offset4 rpa_offset; offset4 rpa_offset;
count4 rpa_count; count4 rpa_count;
}; };
10.9.2. RESULT 11.10.2. RESULT
union read_plus_content switch (data_content4 content) { union read_plus_content switch (data_content4 content) {
case NFS4_CONTENT_DATA: case NFS4_CONTENT_DATA:
opaque rpc_data<>; opaque rpc_data<>;
case NFS4_CONTENT_APP_BLOCK: case NFS4_CONTENT_APP_BLOCK:
app_data_block4 rpc_block; app_data_block4 rpc_block;
case NFS4_CONTENT_HOLE: case NFS4_CONTENT_HOLE:
hole_info4 rpc_hole; data_info4 rpc_hole;
default: default:
void; void;
}; };
/* /*
* Allow a return of an array of contents. * Allow a return of an array of contents.
*/ */
struct read_plus_res4 { struct read_plus_res4 {
bool rpr_eof; bool rpr_eof;
read_plus_content rpr_contents<>; read_plus_content rpr_contents<>;
}; };
union READ_PLUS4res switch (nfsstat4 status) { union READ_PLUS4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
read_plus_res4 resok4; read_plus_res4 resok4;
default: default:
void; void;
}; };
10.9.3. DESCRIPTION 11.10.3. DESCRIPTION
Over the given range, READ_PLUS will return all data and ADBs found Over the given range, READ_PLUS will return all data and ADBs found
as an array of read_plus_content. It is possible to have consecutive as an array of read_plus_content. It is possible to have consecutive
ADBs in the array as either different definitions of ADBs are present ADBs in the array as either different definitions of ADBs are present
or as the guard pattern changes. or as the guard pattern changes.
Edge cases exist for ABDs which either begin before the rpa_offset Edge cases exist for ABDs which either begin before the rpa_offset
requested by the READ_PLUS or end after the rpa_count requested - requested by the READ_PLUS or end after the rpa_count requested -
both of which may occur as not all applications which access the file both of which may occur as not all applications which access the file
are aware of the main application imposing a format on the file are aware of the main application imposing a format on the file
contents, i.e., tar, dd, cp, etc. READ_PLUS MUST retrieve whole contents, i.e., tar, dd, cp, etc. READ_PLUS MUST retrieve whole
ADBs, but it need not retrieve an entire sequences of ADBs. ADBs, but it need not retrieve an entire sequences of ADBs.
The server MUST return a whole ADB because if it does not, it must The server MUST return a whole ADB because if it does not, it must
expand that partial ADB before it sends it to the client. E.g., if expand that partial ADB before it sends it to the client. E.g., if
an ADB had a block size of 64k and the READ_PLUS was for 128k an ADB had a block size of 64k and the READ_PLUS was for 128k
starting at an offset of 32k inside the ADB, then the first 32k would starting at an offset of 32k inside the ADB, then the first 32k would
be converted to data. be converted to data.
11. NFSv4.2 Callback Operations 11.11. Operation 66: SEEK
11.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's XXX
11.11.1. ARGUMENT
struct SEEK4args {
/* CURRENT_FH: file */
stateid4 sa_stateid;
offset4 sa_offset;
count4 sa_count;
};
11.11.2. RESULT
union seek_content switch (data_content4 content) {
case NFS4_CONTENT_DATA:
data_info4 sc_data;
case NFS4_CONTENT_APP_BLOCK:
app_data_block4 sc_block;
case NFS4_CONTENT_HOLE:
data_info4 sc_hole;
default:
void;
};
/*
* Allow a return of an array of contents.
*/
struct seek_res4 {
bool sr_eof;
seek_content sr_contents;
};
union SEEK4res switch (nfsstat4 status) {
case NFS4_OK:
seek_res4 resok4;
default:
void;
};
11.11.3. DESCRIPTION
Over the given range, SEEK will return a range for all data, holes,
and ADBs found as an array of seek_content. It does not return
actual data.
12. NFSv4.2 Callback Operations
12.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's
Attributes Changed Attributes Changed
11.1.1. ARGUMENTS 12.1.1. ARGUMENTS
struct CB_ATTR_CHANGED4args { struct CB_ATTR_CHANGED4args {
nfs_fh4 acca_fh; nfs_fh4 acca_fh;
bitmap4 acca_critical; bitmap4 acca_critical;
bitmap4 acca_info; bitmap4 acca_info;
}; };
11.1.2. RESULTS 12.1.2. RESULTS
struct CB_ATTR_CHANGED4res { struct CB_ATTR_CHANGED4res {
nfsstat4 accr_status; nfsstat4 accr_status;
}; };
11.1.3. DESCRIPTION 12.1.3. DESCRIPTION
The CB_ATTR_CHANGED callback operation is used by the server to The CB_ATTR_CHANGED callback operation is used by the server to
indicate to the client that the file's attributes have been modified indicate to the client that the file's attributes have been modified
on the server. The server does not convey how the attributes have on the server. The server does not convey how the attributes have
changed, just that they have been modified. The server can inform changed, just that they have been modified. The server can inform
the client about both critical and informational attribute changes in the client about both critical and informational attribute changes in
the bitmask arguments. The client SHOULD query the server about all the bitmask arguments. The client SHOULD query the server about all
attributes set in acca_critical. For all changes reflected in attributes set in acca_critical. For all changes reflected in
acca_info, the client can decide whether or not it wants to poll the acca_info, the client can decide whether or not it wants to poll the
server. server.
The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set
in acca_critical is the method used by the server to indicate that in acca_critical is the method used by the server to indicate that
the MAC label for the file referenced by acca_fh has changed. In the MAC label for the file referenced by acca_fh has changed. In
many ways, the server does not care about the result returned by the many ways, the server does not care about the result returned by the
client. client.
11.2. Operation 15: CB_COPY - Report results of a server-side copy 12.2. Operation 15: CB_COPY - Report results of a server-side copy
11.2.1. ARGUMENT
12.2.1. ARGUMENT
union copy_info4 switch (nfsstat4 cca_status) { union copy_info4 switch (nfsstat4 cca_status) {
case NFS4_OK: case NFS4_OK:
void; void;
default: default:
length4 cca_bytes_copied; length4 cca_bytes_copied;
}; };
struct CB_COPY4args { struct CB_COPY4args {
nfs_fh4 cca_fh; nfs_fh4 cca_fh;
stateid4 cca_stateid; stateid4 cca_stateid;
copy_info4 cca_copy_info; copy_info4 cca_copy_info;
}; };
11.2.2. RESULT 12.2.2. RESULT
struct CB_COPY4res { struct CB_COPY4res {
nfsstat4 ccr_status; nfsstat4 ccr_status;
}; };
11.2.3. DESCRIPTION 12.2.3. DESCRIPTION
CB_COPY is used for both intra- and inter-server asynchronous copies. CB_COPY is used for both intra- and inter-server asynchronous copies.
The CB_COPY callback informs the client of the result of an The CB_COPY callback informs the client of the result of an
asynchronous server-side copy. This operation is sent by the asynchronous server-side copy. This operation is sent by the
destination server to the client in a CB_COMPOUND request. The copy destination server to the client in a CB_COMPOUND request. The copy
is identified by the filehandle and stateid arguments. The result is is identified by the filehandle and stateid arguments. The result is
indicated by the status field. If the copy failed, cca_bytes_copied indicated by the status field. If the copy failed, cca_bytes_copied
contains the number of bytes copied before the failure occurred. The contains the number of bytes copied before the failure occurred. The
cca_bytes_copied value indicates the number of bytes copied but not cca_bytes_copied value indicates the number of bytes copied but not
which specific bytes have been copied. which specific bytes have been copied.
skipping to change at page 82, line 8 skipping to change at page 91, line 41
If the client supports the COPY operation, the client is REQUIRED to If the client supports the COPY operation, the client is REQUIRED to
support the CB_COPY operation. support the CB_COPY operation.
The CB_COPY operation may fail for the following reasons (this is a The CB_COPY operation may fail for the following reasons (this is a
partial list): partial list):
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the NFS4ERR_NOTSUPP: The copy offload operation is not supported by the
NFS client receiving this request. NFS client receiving this request.
12. IANA Considerations 13. IANA Considerations
This section uses terms that are defined in [23]. This section uses terms that are defined in [27].
13. References 14. References
13.1. Normative References 14.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", March 1997. Levels", March 1997.
[2] Shepler, S., Eisler, M., and D. Noveck, "Network File System [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 Protocol", RFC 5661, (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
January 2010. January 2010.
[3] Haynes, T., "Network File System (NFS) Version 4 Minor Version [3] Haynes, T., "Network File System (NFS) Version 4 Minor Version
2 External Data Representation Standard (XDR) Description", 2 External Data Representation Standard (XDR) Description",
March 2011. March 2011.
[4] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform [4] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986,
January 2005. January 2005.
[5] Haynes, T. and N. Williams, "Remote Procedure Call (RPC) [5] Haynes, T. and N. Williams, "Remote Procedure Call (RPC)
Security Version 3", draft-williams-rpcsecgssv3 (work in Security Version 3", draft-williams-rpcsecgssv3 (work in
progress), 2011. progress), 2011.
[6] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol [6] The Open Group, "Section 'posix_fadvise()' of System Interfaces
of The Open Group Base Specifications Issue 6, IEEE Std 1003.1,
2004 Edition", 2004.
[7] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
Specification", RFC 2203, September 1997. Specification", RFC 2203, September 1997.
[7] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel [8] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel
NFS (pNFS) Operations", RFC 5664, January 2010. NFS (pNFS) Operations", RFC 5664, January 2010.
[8] Shepler, S., Eisler, M., and D. Noveck, "Network File System [9] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 External Data Representation (NFS) Version 4 Minor Version 1 External Data Representation
Standard (XDR) Description", RFC 5662, January 2010. Standard (XDR) Description", RFC 5662, January 2010.
[9] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS) [10] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS)
Block/Volume Layout", RFC 5663, January 2010. Block/Volume Layout", RFC 5663, January 2010.
13.2. Informative References 14.2. Informative References
[10] Haynes, T. and D. Noveck, "Network File System (NFS) version 4 [11] Haynes, T. and D. Noveck, "Network File System (NFS) version 4
Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress), Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
March 2011. March 2011.
[11] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, [12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
"NSDB Protocol for Federated Filesystems", "NSDB Protocol for Federated Filesystems",
draft-ietf-nfsv4-federated-fs-protocol (Work In Progress), draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
2010. 2010.
[12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, [13] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
"Administration Protocol for Federated Filesystems", "Administration Protocol for Federated Filesystems",
draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010. draft-ietf-nfsv4-federated-fs-admin (Work In Progress), 2010.
[13] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., [14] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
HTTP/1.1", RFC 2616, June 1999. HTTP/1.1", RFC 2616, June 1999.
[14] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, [15] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9,
RFC 959, October 1985. RFC 959, October 1985.
[15] Simpson, W., "PPP Challenge Handshake Authentication Protocol [16] Simpson, W., "PPP Challenge Handshake Authentication Protocol
(CHAP)", RFC 1994, August 1996. (CHAP)", RFC 1994, August 1996.
[16] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of [17] VanDeBogart, S., Frost, C., and E. Kohler, "Reducing Seek
Overhead with Application-Directed Prefetching", Proceedings of
USENIX Annual Technical Conference , June 2009.
[18] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
Oracle Database Concepts 11g Release 1 (11.1)", January 2011. Oracle Database Concepts 11g Release 1 (11.1)", January 2011.
[17] Ashdown, L., "Chapter 15, Validating Database Files and [19] Ashdown, L., "Chapter 15, Validating Database Files and
Backups, of Oracle Database Backup and Recovery User's Guide Backups, of Oracle Database Backup and Recovery User's Guide
11g Release 1 (11.1)", August 2008. 11g Release 1 (11.1)", August 2008.
[18] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory [20] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory
Corruption of Solaris Internals", 2007. Corruption of Solaris Internals", 2007.
[19] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci- [21] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
Corruption in the Storage Stack", Proceedings of the 6th USENIX Corruption in the Storage Stack", Proceedings of the 6th USENIX
Symposium on File and Storage Technologies (FAST '08) , 2008. Symposium on File and Storage Technologies (FAST '08) , 2008.
[20] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide: [22] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide:
Deployment, configuration and administration of Red Hat Deployment, configuration and administration of Red Hat
Enterprise Linux 5, Edition 6", 2011. Enterprise Linux 5, Edition 6", 2011.
[21] Quigley, D. and J. Lu, "Registry Specification for MAC Security [23] Quigley, D. and J. Lu, "Registry Specification for MAC Security
Label Formats", draft-quigley-label-format-registry (work in Label Formats", draft-quigley-label-format-registry (work in
progress), 2011. progress), 2011.
[22] Eisler, M., "XDR: External Data Representation Standard", [24] Eisler, M., "XDR: External Data Representation Standard",
RFC 4506, May 2006. RFC 4506, May 2006.
[23] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA [25] Wong, T. and J. Wilkes, "My cache or yours? Making storage more
exclusive", Proceedings of the USENIX Annual Technical
Conference , 2002.
[26] Muntz, D. and P. Honeyman, "Multi-level Caching in Distributed
File Systems", Proceedings of USENIX Annual Technical
Conference , 1992.
[27] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.
[24] Nowicki, B., "NFS: Network File System Protocol specification", [28] Nowicki, B., "NFS: Network File System Protocol specification",
RFC 1094, March 1989. RFC 1094, March 1989.
[25] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 [29] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3
Protocol Specification", RFC 1813, June 1995. Protocol Specification", RFC 1813, June 1995.
[26] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", [30] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
RFC 1833, August 1995. RFC 1833, August 1995.
[27] Eisler, M., "NFS Version 2 and Version 3 Security Issues and [31] Eisler, M., "NFS Version 2 and Version 3 Security Issues and
the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5",
RFC 2623, June 1999. RFC 2623, June 1999.
[28] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997. [32] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997.
[29] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624, [33] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624,
June 1999. June 1999.
[30] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On- [34] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On-
line Database", RFC 3232, January 2002. line Database", RFC 3232, January 2002.
[31] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964, [35] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964,
June 1996. June 1996.
[32] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, [36] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
C., Eisler, M., and D. Noveck, "Network File System (NFS) C., Eisler, M., and D. Noveck, "Network File System (NFS)
version 4 Protocol", RFC 3530, April 2003. version 4 Protocol", RFC 3530, April 2003.
Appendix A. Acknowledgments Appendix A. Acknowledgments
For the pNFS Access Permissions Check, the original draft was by For the pNFS Access Permissions Check, the original draft was by
Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work
was influenced by discussions with Benny Halevy and Bruce Fields. A was influenced by discussions with Benny Halevy and Bruce Fields. A
review was done by Tom Haynes. review was done by Tom Haynes.
For the Sharing change attribute implementation details with NFSv4 For the Sharing change attribute implementation details with NFSv4
clients, the original draft was by Trond Myklebust. clients, the original draft was by Trond Myklebust.
For the NFS Server-side Copy, the original draft was by James For the NFS Server-side Copy, the original draft was by James
Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul Lentini, Mike Eisler, Deepak Kenchammana, Anshul Madan, and Rahul
Iyer. Talpey co-authored an unpublished version of that document. Iyer. Tom Talpey co-authored an unpublished version of that
document. It was also was reviewed by a number of individuals:
It was also was reviewed by a number of individuals: Pranoop Erasani, Pranoop Erasani, Tom Haynes, Arthur Lent, Trond Myklebust, Dave
Tom Haynes, Arthur Lent, Trond Myklebust, Dave Noveck, Theresa Noveck, Theresa Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani,
Lingutla-Raj, Manjunath Shankararao, Satyam Vaghani, and Nico and Nico Williams.
Williams.
For the NFS space reservation operations, the original draft was by For the NFS space reservation operations, the original draft was by
Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer. Mike Eisler, James Lentini, Manjunath Shankararao, and Rahul Iyer.
For the sparse file support, the original draft was by Dean For the sparse file support, the original draft was by Dean
Hildebrand and Marc Eshel. Valuable input and advice was received Hildebrand and Marc Eshel. Valuable input and advice was received
from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and from Sorin Faibish, Bruce Fields, Benny Halevy, Trond Myklebust, and
Richard Scheffenegger. Richard Scheffenegger.
For the Application IO Hints, the original draft was by Dean
Hildebrand, Mike Eisler, Trond Myklebust, and Sam Falkner. Some
early reviwers included Benny Halevy and Pranoop Erasani.
For Labeled NFS, the original draft was by David Quigley, James For Labeled NFS, the original draft was by David Quigley, James
Morris, Jarret Lu, and Tom Haynes. Peter Staubach, Trond Myklebust, Morris, Jarret Lu, and Tom Haynes. Peter Staubach, Trond Myklebust,
Sorrin Faibish, Nico Williams, and David Black also contributed in Sorrin Faibish, Nico Williams, and David Black also contributed in
the final push to get this accepted. the final push to get this accepted.
Appendix B. RFC Editor Notes Appendix B. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this [RFC Editor: please remove this section prior to publishing this
document as an RFC] document as an RFC]
 End of changes. 177 change blocks. 
267 lines changed or deleted 807 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/