draft-ietf-nfsv4-minorversion2-06.txt   draft-ietf-nfsv4-minorversion2-07.txt 
NFSv4 T. Haynes NFSv4 T. Haynes
Internet-Draft Editor Internet-Draft Editor
Intended status: Standards Track November 14, 2011 Intended status: Standards Track January 04, 2012
Expires: May 17, 2012 Expires: July 7, 2012
NFS Version 4 Minor Version 2 NFS Version 4 Minor Version 2
draft-ietf-nfsv4-minorversion2-06.txt draft-ietf-nfsv4-minorversion2-07.txt
Abstract Abstract
This Internet-Draft describes NFS version 4 minor version two, This Internet-Draft describes NFS version 4 minor version two,
focusing mainly on the protocol extensions made from NFS version 4 focusing mainly on the protocol extensions made from NFS version 4
minor version 0 and NFS version 4 minor version 1. Major extensions minor version 0 and NFS version 4 minor version 1. Major extensions
introduced in NFS version 4 minor version two include: Server-side introduced in NFS version 4 minor version two include: Server-side
Copy, Space Reservations, and Support for Sparse Files. Copy, Space Reservations, and Support for Sparse Files.
Requirements Language Requirements Language
skipping to change at page 1, line 40 skipping to change at page 1, line 40
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 17, 2012. This Internet-Draft will expire on July 7, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 3, line 7 skipping to change at page 3, line 7
modifications of such material outside the IETF Standards Process. modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other it for publication as an RFC or to translate it into languages other
than English. than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 6 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 5
1.2. Scope of This Document . . . . . . . . . . . . . . . . . 6 1.2. Scope of This Document . . . . . . . . . . . . . . . . . 5
1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 6 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5
1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 6 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 5
1.4.1. Application I/O Advise . . . . . . . . . . . . . . . . 6 1.4.1. Sparse Files . . . . . . . . . . . . . . . . . . . . . 5
1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 7 1.4.2. Application I/O Advise . . . . . . . . . . . . . . . . 6
2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 7 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 6
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 7 2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 6
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 6
2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 7 2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 7
2.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 9 2.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 8
2.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 10 2.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 9
2.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 13 2.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 12
2.3. Operations . . . . . . . . . . . . . . . . . . . . . . . 15 2.3. Operations . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 15 2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 14
2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 16 2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 15
2.4. Security Considerations . . . . . . . . . . . . . . . . . 16 2.4. Security Considerations . . . . . . . . . . . . . . . . . 15
2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 16 2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 15
3. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 24 3. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 24 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 24
3.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 25 3.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 24
3.3. Overview of Sparse Files and NFSv4 . . . . . . . . . . . 25 3.3. Determining the next hole/data . . . . . . . . . . . . . 25
3.4. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 26 4. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 26 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 27 5. Support for Application IO Hints . . . . . . . . . . . . . . . 27
3.4.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 27 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 27
3.4.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 29 5.2. POSIX Requirements . . . . . . . . . . . . . . . . . . . 28
3.4.5. READ_PLUS with Sparse Files Example . . . . . . . . . 30 5.3. Additional Requirements . . . . . . . . . . . . . . . . . 29
3.5. Related Work . . . . . . . . . . . . . . . . . . . . . . 31 5.4. Security Considerations . . . . . . . . . . . . . . . . . 30
3.6. Other Proposed Designs . . . . . . . . . . . . . . . . . 31 5.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 30
3.6.1. Multi-Data Server Hole Information . . . . . . . . . . 31 6. Application Data Block Support . . . . . . . . . . . . . . . . 30
3.6.2. Data Result Array . . . . . . . . . . . . . . . . . . 32 6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 31
3.6.3. User-Defined Sparse Mask . . . . . . . . . . . . . . . 32 6.1.1. Data Block Representation . . . . . . . . . . . . . . 31
3.6.4. Allocated flag . . . . . . . . . . . . . . . . . . . . 32 6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 32
3.6.5. Dense and Sparse pNFS File Layouts . . . . . . . . . . 33 6.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 32
4. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 33 6.3. An Example of Detecting Corruption . . . . . . . . . . . 33
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 33 6.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 34
4.2. Operations and attributes . . . . . . . . . . . . . . . . 35 6.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 35
4.3. Attribute 77: space_reserved . . . . . . . . . . . . . . 35 7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4. Attribute 78: space_freed . . . . . . . . . . . . . . . . 36 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 35
5. Support for Application IO Hints . . . . . . . . . . . . . . . 36 7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 36
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 36 7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 37
5.2. POSIX Requirements . . . . . . . . . . . . . . . . . . . 37 7.3.1. Interpreting FATTR4_SEC_LABEL . . . . . . . . . . . . 37
5.3. Additional Requirements . . . . . . . . . . . . . . . . . 38 7.3.2. Delegations . . . . . . . . . . . . . . . . . . . . . 38
5.4. Security Considerations . . . . . . . . . . . . . . . . . 39 7.3.3. Permission Checking . . . . . . . . . . . . . . . . . 38
5.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 39 7.3.4. Object Creation . . . . . . . . . . . . . . . . . . . 39
7.3.5. Existing Objects . . . . . . . . . . . . . . . . . . . 39
6. Application Data Block Support . . . . . . . . . . . . . . . . 39 7.3.6. Label Changes . . . . . . . . . . . . . . . . . . . . 39
6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 40 7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 40
6.1.1. Data Block Representation . . . . . . . . . . . . . . 40 7.5. Discovery of Server LNFS Support . . . . . . . . . . . . 40
6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 41 7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 41
6.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 41 7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 41
6.3. An Example of Detecting Corruption . . . . . . . . . . . 42 7.6.2. Smart Client Mode . . . . . . . . . . . . . . . . . . 42
6.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 43 7.6.3. Smart Server Mode . . . . . . . . . . . . . . . . . . 43
6.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 44 7.7. Security Considerations . . . . . . . . . . . . . . . . . 44
7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 44
7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 45
7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 46
7.3.1. Interpreting FATTR4_SEC_LABEL . . . . . . . . . . . . 46
7.3.2. Delegations . . . . . . . . . . . . . . . . . . . . . 47
7.3.3. Permission Checking . . . . . . . . . . . . . . . . . 47
7.3.4. Object Creation . . . . . . . . . . . . . . . . . . . 48
7.3.5. Existing Objects . . . . . . . . . . . . . . . . . . . 48
7.3.6. Label Changes . . . . . . . . . . . . . . . . . . . . 48
7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 49
7.5. Discovery of Server LNFS Support . . . . . . . . . . . . 49
7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 50
7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 50
7.6.2. Smart Client Mode . . . . . . . . . . . . . . . . . . 51
7.6.3. Smart Server Mode . . . . . . . . . . . . . . . . . . 52
7.7. Security Considerations . . . . . . . . . . . . . . . . . 53
8. Sharing change attribute implementation details with NFSv4 8. Sharing change attribute implementation details with NFSv4
clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 53 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 44
8.2. Definition of the 'change_attr_type' per-file system 8.2. Definition of the 'change_attr_type' per-file system
attribute . . . . . . . . . . . . . . . . . . . . . . . . 54 attribute . . . . . . . . . . . . . . . . . . . . . . . . 45
9. Security Considerations . . . . . . . . . . . . . . . . . . . 55 9. Security Considerations . . . . . . . . . . . . . . . . . . . 46
10. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 55 10. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 46
11. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 59 10.1. Attribute Definitions . . . . . . . . . . . . . . . . . . 46
11.1. Operation 59: COPY - Initiate a server-side copy . . . . 59 11. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 47
11.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . 66 12. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 50
11.3. Operation 61: COPY_NOTIFY - Notify a source server of 12.1. Operation 59: COPY - Initiate a server-side copy . . . . 50
a future copy . . . . . . . . . . . . . . . . . . . . . . 67 12.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . 58
11.4. Operation 62: COPY_REVOKE - Revoke a destination 12.3. Operation 61: COPY_NOTIFY - Notify a source server of
server's copy privileges . . . . . . . . . . . . . . . . 70 a future copy . . . . . . . . . . . . . . . . . . . . . . 59
11.5. Operation 63: COPY_STATUS - Poll for status of a 12.4. Operation 62: COPY_REVOKE - Revoke a destination
server-side copy . . . . . . . . . . . . . . . . . . . . 71 server's copy privileges . . . . . . . . . . . . . . . . 62
11.6. Modification to Operation 42: EXCHANGE_ID - 12.5. Operation 63: COPY_STATUS - Poll for status of a
Instantiate Client ID . . . . . . . . . . . . . . . . . . 72 server-side copy . . . . . . . . . . . . . . . . . . . . 63
11.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . 73 12.6. Modification to Operation 42: EXCHANGE_ID -
11.8. Operation 67: IO_ADVISE - Application I/O access Instantiate Client ID . . . . . . . . . . . . . . . . . . 64
pattern hints . . . . . . . . . . . . . . . . . . . . . . 76 12.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . 65
11.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 83 12.8. Operation 67: IO_ADVISE - Application I/O access
11.9.1. Introduction . . . . . . . . . . . . . . . . . . . . . 83 pattern hints . . . . . . . . . . . . . . . . . . . . . . 69
11.9.2. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 84 12.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 75
11.9.3. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 84 12.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 78
11.9.4. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 84 12.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 84
11.9.5. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 85 13. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 86
11.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 86 13.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that
11.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 88 the File's Attributes Changed . . . . . . . . . . . . . . 86
12. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 89 13.2. Operation 15: CB_COPY - Report results of a
12.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that server-side copy . . . . . . . . . . . . . . . . . . . . 86
the File's Attributes Changed . . . . . . . . . . . . . . 89 14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 88
12.2. Operation 15: CB_COPY - Report results of a 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 88
server-side copy . . . . . . . . . . . . . . . . . . . . 90 15.1. Normative References . . . . . . . . . . . . . . . . . . 88
13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 91 15.2. Informative References . . . . . . . . . . . . . . . . . 89
14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 91
14.1. Normative References . . . . . . . . . . . . . . . . . . 91 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 91
14.2. Informative References . . . . . . . . . . . . . . . . . 92 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 91
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 94
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 95
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 95
1. Introduction 1. Introduction
1.1. The NFS Version 4 Minor Version 2 Protocol 1.1. The NFS Version 4 Minor Version 2 Protocol
The NFS version 4 minor version 2 (NFSv4.2) protocol is the third The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
minor version of the NFS version 4 (NFSv4) protocol. The first minor minor version of the NFS version 4 (NFSv4) protocol. The first minor
version, NFSv4.0, is described in [11] and the second minor version, version, NFSv4.0, is described in [11] and the second minor version,
NFSv4.1, is described in [2]. It follows the guidelines for minor NFSv4.1, is described in [2]. It follows the guidelines for minor
versioning that are listed in Section 11 of [11]. versioning that are listed in Section 11 of [11].
skipping to change at page 6, line 45 skipping to change at page 5, line 45
The full XDR for NFSv4.2 is presented in [3]. The full XDR for NFSv4.2 is presented in [3].
1.3. NFSv4.2 Goals 1.3. NFSv4.2 Goals
[[Comment.1: This needs fleshing out! --TH]] [[Comment.1: This needs fleshing out! --TH]]
1.4. Overview of NFSv4.2 Features 1.4. Overview of NFSv4.2 Features
[[Comment.2: This needs fleshing out! --TH]] [[Comment.2: This needs fleshing out! --TH]]
1.4.1. Application I/O Advise 1.4.1. Sparse Files
Two new operations are defined to support the reading of sparse files
(READ_PLUS) and the punching of holes to remove backing storage
(INITIALIZE).
1.4.2. Application I/O Advise
We propose a new IO_ADVISE operation for NFSv4.2 that clients can use We propose a new IO_ADVISE operation for NFSv4.2 that clients can use
to communicate expected I/O behavior to the server. By communicating to communicate expected I/O behavior to the server. By communicating
future I/O behavior such as whether a file will be accessed future I/O behavior such as whether a file will be accessed
sequentially or randomly, and whether a file will or will not be sequentially or randomly, and whether a file will or will not be
accessed in the near future, servers can optimize future I/O requests accessed in the near future, servers can optimize future I/O requests
for a file by, for example, prefetching or evicting data. This for a file by, for example, prefetching or evicting data. This
operation can be used to support the posix_fadvise function as well operation can be used to support the posix_fadvise function as well
as other applications such as databases and video editors. as other applications such as databases and video editors.
skipping to change at page 23, line 50 skipping to change at page 22, line 50
the challenge is how the source server and destination server the challenge is how the source server and destination server
identify themselves to each other, especially in the presence of identify themselves to each other, especially in the presence of
multi-homed source and destination servers. In a multi-homed multi-homed source and destination servers. In a multi-homed
environment, the destination server might not contact the source environment, the destination server might not contact the source
server from the same network address specified by the client in the server from the same network address specified by the client in the
COPY_NOTIFY. This can be overcome using the procedure described COPY_NOTIFY. This can be overcome using the procedure described
below. below.
When the client sends the source server the COPY_NOTIFY operation, When the client sends the source server the COPY_NOTIFY operation,
the source server may reply to the client with a list of target the source server may reply to the client with a list of target
addresses, names, and/or URLs and assign them to the unique triple: addresses, names, and/or URLs and assign them to the unique
<source fh, user ID, destination address Y>. If the destination uses quadruple: <random number, source fh, user ID, destination address
one of these target netlocs to contact the source server, the source Y>. If the destination uses one of these target netlocs to contact
server will be able to uniquely identify the destination server, even the source server, the source server will be able to uniquely
if the destination server does not connect from the address specified identify the destination server, even if the destination server does
by the client in COPY_NOTIFY. not connect from the address specified by the client in COPY_NOTIFY.
The level of assurance in this identification depends on the
unpredictability, strength and secrecy of the random number.
For example, suppose the network topology is as shown in Figure 3. For example, suppose the network topology is as shown in Figure 3.
If the source filehandle is 0x12345, the source server may respond to If the source filehandle is 0x12345, the source server may respond to
a COPY_NOTIFY for destination 10.11.78.56 with the URLs: a COPY_NOTIFY for destination 10.11.78.56 with the URLs:
nfs://10.11.78.18//_COPY/10.11.78.56/_FH/0x12345 nfs://10.11.78.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/10.11.78.56/_FH/
0x12345
nfs://192.168.33.18//_COPY/10.11.78.56/_FH/0x12345 nfs://192.168.33.18//_COPY/FvhH1OKbu8VrxvV1erdjvR7N/10.11.78.56/
_FH/0x12345
The name component after _COPY is 24 characters of base 64, more than
enough to encode a 128 bit random number.
The client will then send these URLs to the destination server in the The client will then send these URLs to the destination server in the
COPY operation. Suppose that the 192.168.33.0/24 network is a high COPY operation. Suppose that the 192.168.33.0/24 network is a high
speed network and the destination server decides to transfer the file speed network and the destination server decides to transfer the file
over this network. If the destination contacts the source server over this network. If the destination contacts the source server
from 192.168.33.56 over this network using NFSv4.1, it does the from 192.168.33.56 over this network using NFSv4.1, it does the
following: following:
COMPOUND { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP "10.11.78.56"; LOOKUP COMPOUND { PUTROOTFH, LOOKUP "_COPY" ; LOOKUP
"_FH" ; OPEN "0x12345" ; GETFH } "FvhH1OKbu8VrxvV1erdjvR7N" ; LOOKUP "10.11.78.56"; LOOKUP "_FH" ;
OPEN "0x12345" ; GETFH }
The source server will therefore know that these NFSv4.1 operations Provided that the random number is unpredictable and has been kept
are being issued by the destination server identified in the secret by the parties involved, the source server will therefore know
COPY_NOTIFY. that these NFSv4.x operations are being issued by the destination
server identified in the COPY_NOTIFY. This random number technique
only provides initial authentication of the destination server, and
cannot defend against man-in-the-middle attacks after authentication
or an eavesdropper that observes the random number on the wire.
Other secure communication techniques (e.g., IPsec) are necessary to
block these attacks.
2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3
The same techniques as Section 2.4.1.3, using unique URLs for each The same techniques as Section 2.4.1.3, using unique URLs for each
destination server, can be used for other protocols (e.g., HTTP [14] destination server, can be used for other protocols (e.g., HTTP [14]
and FTP [15]) as well. and FTP [15]) as well.
3. Sparse Files 3. Sparse Files
3.1. Introduction 3.1. Introduction
skipping to change at page 25, line 18 skipping to change at page 24, line 34
the zeroes to be transferred. the zeroes to be transferred.
A sparse file is typically created by initializing the file to be all A sparse file is typically created by initializing the file to be all
zeros - nothing is written to the data in the file, instead the hole zeros - nothing is written to the data in the file, instead the hole
is recorded in the metadata for the file. So a 8G disk image might is recorded in the metadata for the file. So a 8G disk image might
be represented initially by a couple hundred bits in the inode and be represented initially by a couple hundred bits in the inode and
nothing on the disk. If the VM then writes 100M to a file in the nothing on the disk. If the VM then writes 100M to a file in the
middle of the image, there would now be two holes represented in the middle of the image, there would now be two holes represented in the
metadata and 100M in the data. metadata and 100M in the data.
This section introduces a new operation READ_PLUS which supports all This section introduces a new operation READ_PLUS (Section 12.10)
the features of READ but includes an extension to support sparse which supports all the features of READ but includes an extension to
pattern files. READ_PLUS is guaranteed to perform no worse than support sparse pattern files. READ_PLUS is guaranteed to perform no
READ, and can dramatically improve performance with sparse files. worse than READ, and can dramatically improve performance with sparse
READ_PLUS does not depend on pNFS protocol features, but can be used files. READ_PLUS does not depend on pNFS protocol features, but can
by pNFS to support sparse files. be used by pNFS to support sparse files.
3.2. Terminology 3.2. Terminology
Regular file: An object of file type NF4REG or NF4NAMEDATTR. Regular file: An object of file type NF4REG or NF4NAMEDATTR.
Sparse file: A Regular file that contains one or more Holes. Sparse file: A Regular file that contains one or more Holes.
Hole: A byte range within a Sparse file that contains regions of all Hole: A byte range within a Sparse file that contains regions of all
zeroes. For block-based file systems, this could also be an zeroes. For block-based file systems, this could also be an
unallocated region of the file. unallocated region of the file.
Hole Threshold: The minimum length of a Hole as determined by the Hole Threshold: The minimum length of a Hole as determined by the
server. If a server chooses to define a Hole Threshold, then it server. If a server chooses to define a Hole Threshold, then it
would not return hole information (nfs_readplusreshole) with a would not return hole information about holes with a length
hole_offset and hole_length that specify a range shorter than the shorter than the Hole Threshold.
Hole Threshold.
3.3. Overview of Sparse Files and NFSv4
This section provides sparse file support to the largest number of
NFS client and server implementations, and as such proposes to add a
new return code to the READ_PLUS operation instead of proposing
additions or extensions of new or existing optional features (such as
pNFS).
3.4. Operation 65: READ_PLUS
The section introduces a new read operation, named READ_PLUS, which
allows NFS clients to avoid reading holes in a sparse file.
READ_PLUS is guaranteed to perform no worse than READ, and can
dramatically improve performance with sparse files.
READ_PLUS supports all the features of the existing NFSv4.1 READ
operation [2] and adds a simple yet significant extension to the
format of its response. The change allows the client to avoid
returning all zeroes from a file hole, wasting computational and
network resources and reducing performance. READ_PLUS uses a new
result structure that tells the client that the result is all zeroes
AND the byte-range of the hole in which the request was made.
Returning the hole's byte-range, and only upon request, avoids
transferring large Data Region Maps that may be soon invalidated and
contain information about a file that may not even be read in its
entirely.
A new read operation is required due to NFSv4.1 minor versioning
rules that do not allow modification of existing operation's
arguments or results. READ_PLUS is designed in such a way to allow
future extensions to the result structure. The same approach could
be taken to extend the argument structure, but a good use case is
first required to make such a change.
3.4.1. ARGUMENT
struct READ_PLUS4args {
/* CURRENT_FH: file */
stateid4 rpa_stateid;
offset4 rpa_offset;
count4 rpa_count;
};
3.4.2. RESULT
union read_plus_content switch (data_content4 content) {
case NFS4_CONTENT_DATA:
opaque rpc_data<>;
case NFS4_CONTENT_APP_BLOCK:
app_data_block4 rpc_block;
case NFS4_CONTENT_HOLE:
data_info4 rpc_hole;
default:
void;
};
/*
* Allow a return of an array of contents.
*/
struct read_plus_res4 {
bool rpr_eof;
read_plus_content rpr_contents<>;
};
union READ_PLUS4res switch (nfsstat4 status) {
case NFS4_OK:
read_plus_res4 resok4;
default:
void;
};
3.4.3. DESCRIPTION
The READ_PLUS operation is based upon the NFSv4.1 READ operation [2],
and similarly reads data from the regular file identified by the
current filehandle.
The client provides an offset of where the READ_PLUS is to start and
a count of how many bytes are to be read. An offset of zero means to
read data starting at the beginning of the file. If offset is
greater than or equal to the size of the file, the status NFS4_OK is
returned with nfs_readplusrestype4 set to READ_OK, data length set to
zero, and eof set to TRUE. The READ_PLUS is subject to access
permissions checking.
If the client specifies a count value of zero, the READ_PLUS succeeds
and returns zero bytes of data, again subject to access permissions
checking. In all situations, the server may choose to return fewer
bytes than specified by the client. The client needs to check for
this condition and handle the condition appropriately.
If the client specifies an offset and count value that is entirely
contained within a hole of the file, the status NFS4_OK is returned
with nfs_readplusresok4 set to READ_HOLE, and if information is
available regarding the hole, a nfs_readplusreshole structure
containing the offset and range of the entire hole. The
nfs_readplusreshole structure is considered valid until the file is
changed (detected via the change attribute). The server MUST provide
the same semantics for nfs_readplusreshole as if the client read the
region and received zeroes; the implied holes contents lifetime MUST
be exactly the same as any other read data.
If the client specifies an offset and count value that begins in a
non-hole of the file but extends into hole the server should return a
short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK,
and data length set to the number of bytes returned. The client will
then issue another READ_PLUS for the remaining bytes, which the
server will respond with information about the hole in the file.
If the server knows that the requested byte range is into a hole of
the file, but has no further information regarding the hole, it
returns a nfs_readplusreshole structure with holeres4 set to
HOLE_NOINFO.
If hole information is available and can be returned to the client,
the server returns a nfs_readplusreshole structure with the value of
holeres4 to HOLE_INFO. The values of hole_offset and hole_length
define the byte-range for the current hole in the file. These values
represent the information known to the server and may describe a
byte-range smaller than the true size of the hole.
Except when special stateids are used, the stateid value for a
READ_PLUS request represents a value returned from a previous byte-
range lock or share reservation request or the stateid associated
with a delegation. The stateid identifies the associated owners if
any and is used by the server to verify that the associated locks are
still valid (e.g., have not been revoked).
If the read ended at the end-of-file (formally, in a correctly formed
READ_PLUS operation, if offset + count is equal to the size of the
file), or the READ_PLUS operation extends beyond the size of the file
(if offset + count is greater than the size of the file), eof is
returned as TRUE; otherwise, it is FALSE. A successful READ_PLUS of
an empty file will always return eof as TRUE.
If the current filehandle is not an ordinary file, an error will be
returned to the client. In the case that the current filehandle
represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If
the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
returned. In all other cases, NFS4ERR_WRONG_TYPE is returned.
For a READ_PLUS with a stateid value of all bits equal to zero, the
server MAY allow the READ_PLUS to be serviced subject to mandatory
byte-range locks or the current share deny modes for the file. For a
READ_PLUS with a stateid value of all bits equal to one, the server
MAY allow READ_PLUS operations to bypass locking checks at the
server.
On success, the current filehandle retains its value.
3.4.4. IMPLEMENTATION
If the server returns a "short read" (i.e., fewer data than requested
and eof is set to FALSE), the client should send another READ_PLUS to
get the remaining data. A server may return less data than requested
under several circumstances. The file may have been truncated by
another client or perhaps on the server itself, changing the file
size from what the requesting client believes to be the case. This
would reduce the actual amount of data available to the client. It
is possible that the server reduce the transfer size and so return a
short read result. Server resource exhaustion may also occur in a
short read.
If mandatory byte-range locking is in effect for the file, and if the
byte-range corresponding to the data to be read from the file is
WRITE_LT locked by an owner not associated with the stateid, the
server will return the NFS4ERR_LOCKED error. The client should try
to get the appropriate READ_LT via the LOCK operation before re-
attempting the READ_PLUS. When the READ_PLUS completes, the client
should release the byte-range lock via LOCKU. In addition, the
server MUST return a nfs_readplusreshole structure with values of
hole_offset and hole_length that are within the owner's locked byte
range.
If another client has an OPEN_DELEGATE_WRITE delegation for the file
being read, the delegation must be recalled, and the operation cannot
proceed until that delegation is returned or revoked. Except where
this happens very quickly, one or more NFS4ERR_DELAY errors will be
returned to requests made while the delegation remains outstanding.
Normally, delegations will not be recalled as a result of a READ_PLUS
operation since the recall will occur as a result of an earlier OPEN.
However, since it is possible for a READ_PLUS to be done with a
special stateid, the server needs to check for this case even though
the client should have done an OPEN previously.
3.4.4.1. Additional pNFS Implementation Information
With pNFS, the semantics of using READ_PLUS remains the same. Any
data server MAY return a READ_HOLE result for a READ_PLUS request
that it receives.
When a data server chooses to return a READ_HOLE result, it has the
option of returning hole information for the data stored on that data
server (as defined by the data layout), but it MUST not return a
nfs_readplusreshole structure with a byte range that includes data
managed by another data server.
1. Data servers that cannot determine hole information SHOULD return
HOLE_NOINFO.
2. Data servers that can obtain hole information for the parts of
the file stored on that data server, the data server SHOULD
return HOLE_INFO and the byte range of the hole stored on that
data server.
A data server should do its best to return as much information about
a hole as is feasible without having to contact the metadata server.
If communication with the metadata server is required, then every
attempt should be taken to minimize the number of requests.
If mandatory locking is enforced, then the data server must also
ensure that to return only information for a Hole that is within the
owner's locked byte range.
3.4.5. READ_PLUS with Sparse Files Example
To see how the return value READ_HOLE will work, the following table
describes a sparse file. For each byte range, the file contains
either non-zero data or a hole. In addition, the server in this
example uses a hole threshold of 32K.
+-------------+----------+
| Byte-Range | Contents |
+-------------+----------+
| 0-15999 | Hole |
| 16K-31999 | Non-Zero |
| 32K-255999 | Hole |
| 256K-287999 | Non-Zero |
| 288K-353999 | Hole |
| 354K-417999 | Non-Zero |
+-------------+----------+
Table 1
Under the given circumstances, if a client was to read the file from
beginning to end with a max read size of 64K, the following will be
the result. This assumes the client has already opened the file and
acquired a valid stateid and just needs to issue READ_PLUS requests.
1. READ_PLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof
= false, data<>[32K]. Return a short read, as the last half of
the request was all zeroes. Note that the first hole is read
back as all zeros as it is below the hole threshhold.
2. READ_PLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range
was all zeros, and the current hole begins at offset 32K and is
224K in length.
3. READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
eof = false, data<>[32K]. Return a short read, as the last half
of the request was all zeroes.
4. READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
nfs_readplusreshole(HOLE_INFO)(288K, 66K).
5. READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
eof = true, data<>[64K].
3.5. Related Work 3.3. Determining the next hole/data
Solaris and ZFS support an extension to lseek(2) that allows Solaris and ZFS support an extension to lseek(2) that allows
applications to discover holes in a file. The values, SEEK_HOLE and applications to discover holes in a file. The values, SEEK_HOLE and
SEEK_DATA, allow clients to seek to the next hole or beginning of SEEK_DATA, allow clients to seek to the next hole or beginning of
data, respectively. data, respectively.
XFS supports the XFS_IOC_GETBMAP extended attribute, which returns
the Data Region Map for a file. Clients can then use this
information to avoid reading holes in a file.
NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows
applications to control whether empty regions of the file are
preallocated and filled in with zeros or simply left unallocated.
3.6. Other Proposed Designs
3.6.1. Multi-Data Server Hole Information
The current design prohibits pnfs data servers from returning hole
information for regions of a file that are not stored on that data
server. Having data servers return information regarding other data
servers changes the fundamental principal that all metadata
information comes from the metadata server.
Here is a brief description if we did choose to support multi-data
server hole information:
For a data server that can obtain hole information for the entire
file without severe performance impact, it MAY return HOLE_INFO and
the byte range of the entire file hole. When a pNFS client receives
a READ_HOLE result and a non-empty nfs_readplusreshole structure, it
MAY use this information in conjunction with a valid layout for the
file to determine the next data server for the next region of data
that is not in a hole.
3.6.2. Data Result Array
If a single read request contains one or more Holes with a length
greater than the Sparse Threshold, the current design would return
results indicating a short read to the client. A client would then
send a series of read requests to the server to retrieve information
for the Holes and the remaining data. To avoid turning a single read
request into several exchanges between the client and server, the
server may need to choose a relatively large Sparse Threshold in
order to decrease the number of short reads it creates. A large
Sparse Threshold may miss many smaller holes, which in turn may
negate the benefits of sparse read support.
To avoid this situation, one option is to have the READ_PLUS
operation return information for multiple holes in a single return
value. This would allow several small holes to be described in a
single read response without requiring multliple exchanges between
the client and server.
One important item to consider with returning an array of data chunks
is its impact on RDMA, which may use different block sizes on the
client and server (among other things).
3.6.3. User-Defined Sparse Mask
Add mask (instead of just zeroes). Specified by server or client?
3.6.4. Allocated flag
A Hole on the server may be an allocated byte-range consisting of all
zeroes or may not be allocated at all. To ensure this information is
properly communicated to the client, it may be beneficial to add a
'alloc' flag to the HOLE_INFO section of nfs_readplusreshole. This
would allow an NFS client to copy a file from one file system to
another and have it more closely resemble the original.
3.6.5. Dense and Sparse pNFS File Layouts
The hole information returned form a data server must be understood
by pNFS clients using both Dense or Sparse file layout types. Does
the current READ_PLUS return value work for both layout types? Does
the data server know if it is using dense or sparse so that it can
return the correct hole_offset and hole_length values?
4. Space Reservation 4. Space Reservation
4.1. Introduction 4.1. Introduction
This section describes a set of operations that allow applications This section describes a set of operations that allow applications
such as hypervisors to reserve space for a file, report the amount of such as hypervisors to reserve space for a file, report the amount of
actual disk space a file occupies and freeup the backing space of a actual disk space a file occupies and freeup the backing space of a
file when it is not required. In virtualized environments, virtual file when it is not required. In virtualized environments, virtual
disk files are often stored on NFS mounted volumes. Since virtual disk files are often stored on NFS mounted volumes. Since virtual
disk files represent the hard disks of virtual machines, hypervisors disk files represent the hard disks of virtual machines, hypervisors
skipping to change at page 35, line 17 skipping to change at page 27, line 19
to the given file that would be freed on its deletion. In the to the given file that would be freed on its deletion. In the
example, both A and B would report space_freed as 4 * BLOCK_SIZE and example, both A and B would report space_freed as 4 * BLOCK_SIZE and
space_used as 10 * BLOCK_SIZE. If A is deleted, B will report space_used as 10 * BLOCK_SIZE. If A is deleted, B will report
space_freed as 10 * BLOCK_SIZE as the deletion of B would result in space_freed as 10 * BLOCK_SIZE as the deletion of B would result in
the deallocation of all 10 blocks. the deallocation of all 10 blocks.
The addition of this problem doesn't solve the problem of space being The addition of this problem doesn't solve the problem of space being
over-reported. However, over-reporting is better than under- over-reported. However, over-reporting is better than under-
reporting. reporting.
4.2. Operations and attributes
In the sections that follow, one operation and three attributes are
defined that together provide the space management facilities
outlined earlier in the document. The operation is intended to be
OPTIONAL and the attributes RECOMMENDED as defined in section 17 of
[2].
4.3. Attribute 77: space_reserved
The space_reserve attribute is a read/write attribute of type
boolean. It is a per file attribute. When the space_reserved
attribute is set via SETATTR, the server must ensure that there is
disk space to accommodate every byte in the file before it can return
success. If the server cannot guarantee this, it must return
NFS4ERR_NOSPC.
If the client tries to grow a file which has the space_reserved
attribute set, the server must guarantee that there is disk space to
accommodate every byte in the file with the new size before it can
return success. If the server cannot guarantee this, it must return
NFS4ERR_NOSPC.
It is not required that the server allocate the space to the file
before returning success. The allocation can be deferred, however,
it must be guaranteed that it will not fail for lack of space.
The value of space_reserved can be obtained at any time through
GETATTR.
In order to avoid ambiguity, the space_reserve bit cannot be set
along with the size bit in SETATTR. Increasing the size of a file
with space_reserve set will fail if space reservation cannot be
guaranteed for the new size. If the file size is decreased, space
reservation is only guaranteed for the new size and the extra blocks
backing the file can be released.
4.4. Attribute 78: space_freed
space_freed gives the number of bytes freed if the file is deleted.
This attribute is read only and is of type length4. It is a per file
attribute.
5. Support for Application IO Hints 5. Support for Application IO Hints
5.1. Introduction 5.1. Introduction
Applications currently have several options for communicating I/O Applications currently have several options for communicating I/O
access patterns to the NFS client. While this can help the NFS access patterns to the NFS client. While this can help the NFS
client optimize I/O and caching for a file, it does not allow the NFS client optimize I/O and caching for a file, it does not allow the NFS
server and its exported file system to do likewise. Therefore, here server and its exported file system to do likewise. Therefore, here
we put forth a proposal for the NFSv4.2 protocol to allow we put forth a proposal for the NFSv4.2 protocol to allow
applications to communicate their expected behavior to the server. applications to communicate their expected behavior to the server.
skipping to change at page 55, line 34 skipping to change at page 46, line 34
Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it Finally, if the client sees NFS4_CHANGE_TYPE_IS_VERSION_COUNTER, it
has the ability to predict what the resulting change attribute value has the ability to predict what the resulting change attribute value
should be after a COMPOUND containing a SETATTR, WRITE, or CREATE. should be after a COMPOUND containing a SETATTR, WRITE, or CREATE.
This again allows it to detect changes made in parallel by another This again allows it to detect changes made in parallel by another
client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits client. The value NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS permits
the same, but only if the client is not doing pNFS WRITEs. the same, but only if the client is not doing pNFS WRITEs.
9. Security Considerations 9. Security Considerations
10. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 10. File Attributes
10.1. Attribute Definitions
10.1.1. Attribute 77: space_reserved
The space_reserve attribute is a read/write attribute of type
boolean. It is a per file attribute. When the space_reserved
attribute is set via SETATTR, the server must ensure that there is
disk space to accommodate every byte in the file before it can return
success. If the server cannot guarantee this, it must return
NFS4ERR_NOSPC.
If the client tries to grow a file which has the space_reserved
attribute set, the server must guarantee that there is disk space to
accommodate every byte in the file with the new size before it can
return success. If the server cannot guarantee this, it must return
NFS4ERR_NOSPC.
It is not required that the server allocate the space to the file
before returning success. The allocation can be deferred, however,
it must be guaranteed that it will not fail for lack of space.
The value of space_reserved can be obtained at any time through
GETATTR.
In order to avoid ambiguity, the space_reserve bit cannot be set
along with the size bit in SETATTR. Increasing the size of a file
with space_reserve set will fail if space reservation cannot be
guaranteed for the new size. If the file size is decreased, space
reservation is only guaranteed for the new size and the extra blocks
backing the file can be released.
10.1.2. Attribute 78: space_freed
space_freed gives the number of bytes freed if the file is deleted.
This attribute is read only and is of type length4. It is a per file
attribute.
11. Operations: REQUIRED, RECOMMENDED, or OPTIONAL
The following tables summarize the operations of the NFSv4.2 protocol The following tables summarize the operations of the NFSv4.2 protocol
and the corresponding designation of REQUIRED, RECOMMENDED, and and the corresponding designation of REQUIRED, RECOMMENDED, and
OPTIONAL to implement or MUST NOT implement. The designation of MUST OPTIONAL to implement or MUST NOT implement. The designation of MUST
NOT implement is reserved for those operations that were defined in NOT implement is reserved for those operations that were defined in
either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2. either NFSv4.0 or NFSV4.1 and MUST NOT be implemented in NFSv4.2.
For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation
for operations sent by the client is for the server implementation. for operations sent by the client is for the server implementation.
The client is generally required to implement the operations needed The client is generally required to implement the operations needed
skipping to change at page 59, line 8 skipping to change at page 50, line 45
| CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS | | CB_RECALL_ANY | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
| CB_RECALL_SLOT | REQ | | | CB_RECALL_SLOT | REQ | |
| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) | | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS (REQ) |
| CB_SEQUENCE | OPT | FDELG, DDELG, pNFS | | CB_SEQUENCE | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
| CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS | | CB_WANTS_CANCELLED | OPT | FDELG, DDELG, pNFS |
| | | (REQ) | | | | (REQ) |
+-------------------------+-------------------+---------------------+ +-------------------------+-------------------+---------------------+
11. NFSv4.2 Operations 12. NFSv4.2 Operations
11.1. Operation 59: COPY - Initiate a server-side copy
11.1.1. ARGUMENT 12.1. Operation 59: COPY - Initiate a server-side copy
12.1.1. ARGUMENT
const COPY4_GUARDED = 0x00000001; const COPY4_GUARDED = 0x00000001;
const COPY4_METADATA = 0x00000002; const COPY4_METADATA = 0x00000002;
struct COPY4args { struct COPY4args {
/* SAVED_FH: source file */ /* SAVED_FH: source file */
/* CURRENT_FH: destination file or */ /* CURRENT_FH: destination file or */
/* directory */ /* directory */
offset4 ca_src_offset; offset4 ca_src_offset;
offset4 ca_dst_offset; offset4 ca_dst_offset;
length4 ca_count; length4 ca_count;
uint32_t ca_flags; uint32_t ca_flags;
component4 ca_destination; component4 ca_destination;
netloc4 ca_source_server<>; netloc4 ca_source_server<>;
}; };
11.1.2. RESULT 12.1.2. RESULT
union COPY4res switch (nfsstat4 cr_status) { union COPY4res switch (nfsstat4 cr_status) {
case NFS4_OK: case NFS4_OK:
stateid4 cr_callback_id<1>; stateid4 cr_callback_id<1>;
default: default:
length4 cr_bytes_copied; length4 cr_bytes_copied;
}; };
11.1.3. DESCRIPTION 12.1.3. DESCRIPTION
The COPY operation is used for both intra-server and inter-server The COPY operation is used for both intra-server and inter-server
copies. In both cases, the COPY is always sent from the client to copies. In both cases, the COPY is always sent from the client to
the destination server of the file copy. The COPY operation requests the destination server of the file copy. The COPY operation requests
that a file be copied from the location specified by the SAVED_FH that a file be copied from the location specified by the SAVED_FH
value to the location specified by the combination of CURRENT_FH and value to the location specified by the combination of CURRENT_FH and
ca_destination. ca_destination.
The SAVED_FH must be a regular file. If SAVED_FH is not a regular The SAVED_FH must be a regular file. If SAVED_FH is not a regular
file, the operation MUST fail and return NFS4ERR_WRONG_TYPE. file, the operation MUST fail and return NFS4ERR_WRONG_TYPE.
skipping to change at page 62, line 14 skipping to change at page 54, line 5
server, the behavior is implementation dependent. server, the behavior is implementation dependent.
If the metadata flag is set and the client is requesting a whole file If the metadata flag is set and the client is requesting a whole file
copy (i.e., ca_count is 0 (zero)), a subset of the destination file's copy (i.e., ca_count is 0 (zero)), a subset of the destination file's
attributes MUST be the same as the source file's corresponding attributes MUST be the same as the source file's corresponding
attributes and a subset of the destination file's attributes SHOULD attributes and a subset of the destination file's attributes SHOULD
be the same as the source file's corresponding attributes. The be the same as the source file's corresponding attributes. The
attributes in the MUST and SHOULD copy subsets will be defined for attributes in the MUST and SHOULD copy subsets will be defined for
each NFS version. each NFS version.
For NFSv4.1, Table 2 and Table 3 list the REQUIRED and RECOMMENDED For NFSv4.1, Table 1 and Table 2 list the REQUIRED and RECOMMENDED
attributes respectively. A "MUST" in the "Copy to destination file?" attributes respectively. A "MUST" in the "Copy to destination file?"
column indicates that the attribute is part of the MUST copy set. A column indicates that the attribute is part of the MUST copy set. A
"SHOULD" in the "Copy to destination file?" column indicates that the "SHOULD" in the "Copy to destination file?" column indicates that the
attribute is part of the SHOULD copy set. attribute is part of the SHOULD copy set.
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
| Name | Id | Copy to destination file? | | Name | Id | Copy to destination file? |
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
| supported_attrs | 0 | no | | supported_attrs | 0 | no |
| type | 1 | MUST | | type | 1 | MUST |
skipping to change at page 62, line 39 skipping to change at page 54, line 30
| symlink_support | 6 | no | | symlink_support | 6 | no |
| named_attr | 7 | no | | named_attr | 7 | no |
| fsid | 8 | no | | fsid | 8 | no |
| unique_handles | 9 | no | | unique_handles | 9 | no |
| lease_time | 10 | no | | lease_time | 10 | no |
| rdattr_error | 11 | no | | rdattr_error | 11 | no |
| filehandle | 19 | no | | filehandle | 19 | no |
| suppattr_exclcreat | 75 | no | | suppattr_exclcreat | 75 | no |
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
Table 2 Table 1
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
| Name | Id | Copy to destination file? | | Name | Id | Copy to destination file? |
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
| acl | 12 | MUST | | acl | 12 | MUST |
| aclsupport | 13 | no | | aclsupport | 13 | no |
| archive | 14 | no | | archive | 14 | no |
| cansettime | 15 | no | | cansettime | 15 | no |
| case_insensitive | 16 | no | | case_insensitive | 16 | no |
| case_preserving | 17 | no | | case_preserving | 17 | no |
skipping to change at page 64, line 14 skipping to change at page 56, line 4
| system | 46 | MUST | | system | 46 | MUST |
| time_access | 47 | MUST | | time_access | 47 | MUST |
| time_access_set | 48 | no | | time_access_set | 48 | no |
| time_backup | 49 | no | | time_backup | 49 | no |
| time_create | 50 | MUST | | time_create | 50 | MUST |
| time_delta | 51 | no | | time_delta | 51 | no |
| time_metadata | 52 | SHOULD | | time_metadata | 52 | SHOULD |
| time_modify | 53 | MUST | | time_modify | 53 | MUST |
| time_modify_set | 54 | no | | time_modify_set | 54 | no |
+--------------------+----+---------------------------+ +--------------------+----+---------------------------+
Table 2
Table 3
[NOTE: The source file's attribute values will take precedence over [NOTE: The source file's attribute values will take precedence over
any attribute values inherited by the destination file.] any attribute values inherited by the destination file.]
In the case of an inter-server copy or an intra-server copy between In the case of an inter-server copy or an intra-server copy between
file systems, the attributes supported for the source file and file systems, the attributes supported for the source file and
destination file could be different. By definition,the REQUIRED destination file could be different. By definition,the REQUIRED
attributes will be supported in all cases. If the metadata flag is attributes will be supported in all cases. If the metadata flag is
set and the source file has a RECOMMENDED attribute that is not set and the source file has a RECOMMENDED attribute that is not
supported for the destination file, the copy MUST fail with supported for the destination file, the copy MUST fail with
skipping to change at page 66, line 36 skipping to change at page 58, line 29
NFS4ERR_DELAY: The server does not have the resources to perform the NFS4ERR_DELAY: The server does not have the resources to perform the
copy operation at the current time. The client should retry the copy operation at the current time. The client should retry the
operation sometime in the future. operation sometime in the future.
NFS4ERR_METADATA_NOTSUPP: The destination file cannot support the NFS4ERR_METADATA_NOTSUPP: The destination file cannot support the
same metadata as the source file. same metadata as the source file.
NFS4ERR_WRONGSEC: The security mechanism being used by the client NFS4ERR_WRONGSEC: The security mechanism being used by the client
does not match the server's security policy. does not match the server's security policy.
11.2. Operation 60: COPY_ABORT - Cancel a server-side copy 12.2. Operation 60: COPY_ABORT - Cancel a server-side copy
11.2.1. ARGUMENT 12.2.1. ARGUMENT
struct COPY_ABORT4args { struct COPY_ABORT4args {
/* CURRENT_FH: desination file */ /* CURRENT_FH: desination file */
stateid4 caa_stateid; stateid4 caa_stateid;
}; };
11.2.2. RESULT 12.2.2. RESULT
struct COPY_ABORT4res { struct COPY_ABORT4res {
nfsstat4 car_status; nfsstat4 car_status;
}; };
11.2.3. DESCRIPTION 12.2.3. DESCRIPTION
COPY_ABORT is used for both intra- and inter-server asynchronous COPY_ABORT is used for both intra- and inter-server asynchronous
copies. The COPY_ABORT operation allows the client to cancel a copies. The COPY_ABORT operation allows the client to cancel a
server-side copy operation that it initiated. This operation is sent server-side copy operation that it initiated. This operation is sent
in a COMPOUND request from the client to the destination server. in a COMPOUND request from the client to the destination server.
This operation may be used to cancel a copy when the application that This operation may be used to cancel a copy when the application that
requested the copy exits before the operation is completed or for requested the copy exits before the operation is completed or for
some other reason. some other reason.
The request contains the filehandle and copy stateid cookies that act The request contains the filehandle and copy stateid cookies that act
skipping to change at page 67, line 42 skipping to change at page 59, line 33
NFS4ERR_RETRY: The abort failed, but a retry at some time in the NFS4ERR_RETRY: The abort failed, but a retry at some time in the
future MAY succeed. future MAY succeed.
NFS4ERR_COMPLETE_ALREADY: The abort failed, and a callback will NFS4ERR_COMPLETE_ALREADY: The abort failed, and a callback will
deliver the results of the copy operation. deliver the results of the copy operation.
NFS4ERR_SERVERFAULT: An error occurred on the server that does not NFS4ERR_SERVERFAULT: An error occurred on the server that does not
map to a specific error code. map to a specific error code.
11.3. Operation 61: COPY_NOTIFY - Notify a source server of a future 12.3. Operation 61: COPY_NOTIFY - Notify a source server of a future
copy copy
11.3.1. ARGUMENT 12.3.1. ARGUMENT
struct COPY_NOTIFY4args { struct COPY_NOTIFY4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: source file */
netloc4 cna_destination_server; netloc4 cna_destination_server;
}; };
11.3.2. RESULT 12.3.2. RESULT
struct COPY_NOTIFY4resok { struct COPY_NOTIFY4resok {
nfstime4 cnr_lease_time; nfstime4 cnr_lease_time;
netloc4 cnr_source_server<>; netloc4 cnr_source_server<>;
}; };
union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
case NFS4_OK: case NFS4_OK:
COPY_NOTIFY4resok resok4; COPY_NOTIFY4resok resok4;
default: default:
void; void;
}; };
11.3.3. DESCRIPTION 12.3.3. DESCRIPTION
This operation is used for an inter-server copy. A client sends this This operation is used for an inter-server copy. A client sends this
operation in a COMPOUND request to the source server to authorize a operation in a COMPOUND request to the source server to authorize a
destination server identified by cna_destination_server to read the destination server identified by cna_destination_server to read the
file specified by CURRENT_FH on behalf of the given user. file specified by CURRENT_FH on behalf of the given user.
The cna_destination_server MUST be specified using the netloc4 The cna_destination_server MUST be specified using the netloc4
network location format. The server is not required to resolve the network location format. The server is not required to resolve the
cna_destination_server address before completing this operation. cna_destination_server address before completing this operation.
skipping to change at page 70, line 11 skipping to change at page 62, line 11
present on the source server. The client can determine the present on the source server. The client can determine the
correct location and reissue the operation with the correct correct location and reissue the operation with the correct
location. location.
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the NFS4ERR_NOTSUPP: The copy offload operation is not supported by the
NFS server receiving this request. NFS server receiving this request.
NFS4ERR_WRONGSEC: The security mechanism being used by the client NFS4ERR_WRONGSEC: The security mechanism being used by the client
does not match the server's security policy. does not match the server's security policy.
11.4. Operation 62: COPY_REVOKE - Revoke a destination server's copy 12.4. Operation 62: COPY_REVOKE - Revoke a destination server's copy
privileges privileges
11.4.1. ARGUMENT 12.4.1. ARGUMENT
struct COPY_REVOKE4args { struct COPY_REVOKE4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: source file */
netloc4 cra_destination_server; netloc4 cra_destination_server;
}; };
11.4.2. RESULT 12.4.2. RESULT
struct COPY_REVOKE4res { struct COPY_REVOKE4res {
nfsstat4 crr_status; nfsstat4 crr_status;
}; };
11.4.3. DESCRIPTION 12.4.3. DESCRIPTION
This operation is used for an inter-server copy. A client sends this This operation is used for an inter-server copy. A client sends this
operation in a COMPOUND request to the source server to revoke the operation in a COMPOUND request to the source server to revoke the
authorization of a destination server identified by authorization of a destination server identified by
cra_destination_server from reading the file specified by CURRENT_FH cra_destination_server from reading the file specified by CURRENT_FH
on behalf of given user. If the cra_destination_server has already on behalf of given user. If the cra_destination_server has already
begun copying the file, a successful return from this operation begun copying the file, a successful return from this operation
indicates that further access will be prevented. indicates that further access will be prevented.
The cra_destination_server MUST be specified using the netloc4 The cra_destination_server MUST be specified using the netloc4
skipping to change at page 71, line 16 skipping to change at page 63, line 16
a partial list): a partial list):
NFS4ERR_MOVED: The file system which contains the source file is not NFS4ERR_MOVED: The file system which contains the source file is not
present on the source server. The client can determine the present on the source server. The client can determine the
correct location and reissue the operation with the correct correct location and reissue the operation with the correct
location. location.
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the NFS4ERR_NOTSUPP: The copy offload operation is not supported by the
NFS server receiving this request. NFS server receiving this request.
11.5. Operation 63: COPY_STATUS - Poll for status of a server-side copy 12.5. Operation 63: COPY_STATUS - Poll for status of a server-side copy
11.5.1. ARGUMENT 12.5.1. ARGUMENT
struct COPY_STATUS4args { struct COPY_STATUS4args {
/* CURRENT_FH: destination file */ /* CURRENT_FH: destination file */
stateid4 csa_stateid; stateid4 csa_stateid;
}; };
11.5.2. RESULT 12.5.2. RESULT
struct COPY_STATUS4resok { struct COPY_STATUS4resok {
length4 csr_bytes_copied; length4 csr_bytes_copied;
nfsstat4 csr_complete<1>; nfsstat4 csr_complete<1>;
}; };
union COPY_STATUS4res switch (nfsstat4 csr_status) { union COPY_STATUS4res switch (nfsstat4 csr_status) {
case NFS4_OK: case NFS4_OK:
COPY_STATUS4resok resok4; COPY_STATUS4resok resok4;
default: default:
void; void;
}; };
11.5.3. DESCRIPTION 12.5.3. DESCRIPTION
COPY_STATUS is used for both intra- and inter-server asynchronous COPY_STATUS is used for both intra- and inter-server asynchronous
copies. The COPY_STATUS operation allows the client to poll the copies. The COPY_STATUS operation allows the client to poll the
server to determine the status of an asynchronous copy operation. server to determine the status of an asynchronous copy operation.
This operation is sent by the client to the destination server. This operation is sent by the client to the destination server.
If this operation is successful, the number of bytes copied are If this operation is successful, the number of bytes copied are
returned to the client in the csr_bytes_copied field. The returned to the client in the csr_bytes_copied field. The
csr_bytes_copied value indicates the number of bytes copied but not csr_bytes_copied value indicates the number of bytes copied but not
which specific bytes have been copied. which specific bytes have been copied.
skipping to change at page 72, line 29 skipping to change at page 64, line 29
NFS4ERR_NOTSUPP: The copy status operation is not supported by the NFS4ERR_NOTSUPP: The copy status operation is not supported by the
NFS server receiving this request. NFS server receiving this request.
NFS4ERR_BAD_STATEID: The stateid is not valid (see Section 2.3.2 NFS4ERR_BAD_STATEID: The stateid is not valid (see Section 2.3.2
below). below).
NFS4ERR_EXPIRED: The stateid has expired (see Copy Offload Stateid NFS4ERR_EXPIRED: The stateid has expired (see Copy Offload Stateid
section below). section below).
11.6. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID 12.6. Modification to Operation 42: EXCHANGE_ID - Instantiate Client ID
11.6.1. ARGUMENT 12.6.1. ARGUMENT
/* new */ /* new */
const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004; const EXCHGID4_FLAG_SUPP_FENCE_OPS = 0x00000004;
11.6.2. RESULT 12.6.2. RESULT
Unchanged Unchanged
11.6.3. MOTIVATION 12.6.3. MOTIVATION
Enterprise applications require guarantees that an operation has Enterprise applications require guarantees that an operation has
either aborted or completed. NFSv4.1 provides this guarantee as long either aborted or completed. NFSv4.1 provides this guarantee as long
as the session is alive: simply send a SEQUENCE operation on the same as the session is alive: simply send a SEQUENCE operation on the same
slot with a new sequence number, and the successful return of slot with a new sequence number, and the successful return of
SEQUENCE indicates the previous operation has completed. However, if SEQUENCE indicates the previous operation has completed. However, if
the session is lost, there is no way to know when any in progress the session is lost, there is no way to know when any in progress
operations have aborted or completed. In hindsight, the NFSv4.1 operations have aborted or completed. In hindsight, the NFSv4.1
specification should have mandated that DESTROY_SESSION abort/ specification should have mandated that DESTROY_SESSION abort/
complete all outstanding operations. complete all outstanding operations.
11.6.4. DESCRIPTION 12.6.4. DESCRIPTION
A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability A client SHOULD request the EXCHGID4_FLAG_SUPP_FENCE_OPS capability
when it sends an EXCHANGE_ID operation. The server SHOULD set this when it sends an EXCHANGE_ID operation. The server SHOULD set this
capability in the EXCHANGE_ID reply whether the client requests it or capability in the EXCHANGE_ID reply whether the client requests it or
not. If the client ID is created with this capability then the not. If the client ID is created with this capability then the
following will occur: following will occur:
o The server will not reply to DESTROY_SESSION until all operations o The server will not reply to DESTROY_SESSION until all operations
in progress are completed or aborted. in progress are completed or aborted.
skipping to change at page 73, line 34 skipping to change at page 65, line 34
sessions, opens, locks, delegations, layouts, and/or wants are sessions, opens, locks, delegations, layouts, and/or wants are
deleted. deleted.
o The NFS server SHOULD support client ID trunking, and if it does o The NFS server SHOULD support client ID trunking, and if it does
and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a and the EXCHGID4_FLAG_SUPP_FENCE_OPS capability is enabled, then a
session ID created on one node of the storage cluster MUST be session ID created on one node of the storage cluster MUST be
destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID destroyable via DESTROY_SESSION. In addition, DESTROY_CLIENTID
and an EXCHANGE_ID with a new verifier affects all sessions and an EXCHANGE_ID with a new verifier affects all sessions
regardless what node the sessions were created on. regardless what node the sessions were created on.
11.7. Operation 64: INITIALIZE 12.7. Operation 64: INITIALIZE
This operation can be used to initialize the structure imposed by an This operation can be used to initialize the structure imposed by an
application onto a file and to punch a hole into a file. application onto a file and to punch a hole into a file.
The server has no concept of the structure imposed by the The server has no concept of the structure imposed by the
application. It is only when the application writes to a section of application. It is only when the application writes to a section of
the file does order get imposed. In order to detect corruption even the file does order get imposed. In order to detect corruption even
before the application utilizes the file, the application will want before the application utilizes the file, the application will want
to initialize a range of ADBs. It uses the INITIALIZE operation to to initialize a range of ADBs. It uses the INITIALIZE operation to
do so. do so.
11.7.1. ARGUMENT 12.7.1. ARGUMENT
/* /*
* We use data_content4 in case we wish to * We use data_content4 in case we wish to
* extend new types later. Note that we * extend new types later. Note that we
* are explicitly disallowing data. * are explicitly disallowing data.
*/ */
union initialize_arg4 switch (data_content4 content) { union initialize_arg4 switch (data_content4 content) {
case NFS4_CONTENT_APP_BLOCK: case NFS4_CONTENT_APP_BLOCK:
app_data_block4 ia_adb; app_data_block4 ia_adb;
case NFS4_CONTENT_HOLE: case NFS4_CONTENT_HOLE:
skipping to change at page 74, line 28 skipping to change at page 66, line 28
void; void;
}; };
struct INITIALIZE4args { struct INITIALIZE4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 ia_stateid; stateid4 ia_stateid;
stable_how4 ia_stable; stable_how4 ia_stable;
initialize_arg4 ia_data<>; initialize_arg4 ia_data<>;
}; };
11.7.2. RESULT 12.7.2. RESULT
struct INITIALIZE4resok { struct INITIALIZE4resok {
count4 ir_count; count4 ir_count;
stable_how4 ir_committed; stable_how4 ir_committed;
verifier4 ir_writeverf; verifier4 ir_writeverf;
data_content4 ir_sparse; data_content4 ir_sparse;
}; };
union INITIALIZE4res switch (nfsstat4 status) { union INITIALIZE4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
INITIALIZE4resok resok4; INITIALIZE4resok resok4;
default: default:
void; void;
}; };
11.7.3. DESCRIPTION 12.7.3. DESCRIPTION
When the client invokes the INITIALIZE operation, it has two desired When the client invokes the INITIALIZE operation, it has two desired
results: results:
1. The structure described by the app_data_block4 be imposed on the 1. The structure described by the app_data_block4 be imposed on the
file. file.
2. The contents described by the app_data_block4 be sparse. 2. The contents described by the app_data_block4 be sparse.
If the server supports the INITIALIZE operation, it still might not If the server supports the INITIALIZE operation, it still might not
support sparse files. So if it receives the INITIALIZE operation, support sparse files. So if it receives the INITIALIZE operation,
then it MUST populate the contents of the file with the initialized then it MUST populate the contents of the file with the initialized
ADBs. In other words, if the server supports INITIALIZE, then it ADBs. In other words, if the server supports INITIALIZE, then it
supports the concept of ADBs. [[Comment.7: Do we want to support an supports the concept of ADBs. [[Comment.7: Do we want to support an
asynchronous INITIALIZE? Do we have to? --TH]] asynchronous INITIALIZE? Do we have to? --TH]] [[Comment.8: Need to
document union arm error code. --TH]]
If the data was already initialized, There are two interesting If the data was already initialized, There are two interesting
scenarios: scenarios:
1. The data blocks are allocated. 1. The data blocks are allocated.
2. Initializing in the middle of an existing ADB. 2. Initializing in the middle of an existing ADB.
If the data blocks were already allocated, then the INITIALIZE is a If the data blocks were already allocated, then the INITIALIZE is a
hole punch operation. If INITIALIZE supports sparse files, then the hole punch operation. If INITIALIZE supports sparse files, then the
data blocks are to be deallocated. If not, then the data blocks are data blocks are to be deallocated. If not, then the data blocks are
to be rewritten in the indicated ADB format. [[Comment.8: Need to to be rewritten in the indicated ADB format. [[Comment.9: Need to
document interaction between space reservation and hole punching? document interaction between space reservation and hole punching?
--TH]] --TH]]
Since the server has no knowledge of ADBs, it should not report Since the server has no knowledge of ADBs, it should not report
misaligned creation of ADBs. Even while it can detect them, it misaligned creation of ADBs. Even while it can detect them, it
cannot disallow them, as the application might be in the process of cannot disallow them, as the application might be in the process of
changing the size of the ADBs. Thus the server must be prepared to changing the size of the ADBs. Thus the server must be prepared to
handle an INITIALIZE into an existing ADB. handle an INITIALIZE into an existing ADB.
This document does not mandate the manner in which the server stores This document does not mandate the manner in which the server stores
ADBs sparsely for a file. It does assume that if ADBs are stored ADBs sparsely for a file. It does assume that if ADBs are stored
sparsely, then the server can detect when an INITIALIZE arrives that sparsely, then the server can detect when an INITIALIZE arrives that
will force a new ADB to start inside an existing ADB. For example, will force a new ADB to start inside an existing ADB. For example,
assume that ADBi has a adb_block_size of 4k and that an INITIALIZE assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
starts 1k inside ADBi. The server should [[Comment.9: Need to flesh starts 1k inside ADBi. The server should [[Comment.10: Need to flesh
this out. --TH]] this out. --TH]]
11.7.3.1. Hole punching 12.7.3.1. Hole punching
Whenever a client wishes to deallocate the blocks backing a Whenever a client wishes to deallocate the blocks backing a
particular region in the file, it calls the INITIALIZE operation with particular region in the file, it calls the INITIALIZE operation with
the current filehandle set to the filehandle of the file in question, the current filehandle set to the filehandle of the file in question,
start offset and length in bytes of the region set in hpa_offset and start offset and length in bytes of the region set in hpa_offset and
hpa_count respectively. All further reads to this region MUST return hpa_count respectively. All further reads to this region MUST return
zeros until overwritten. The filehandle specified must be that of a zeros until overwritten. The filehandle specified must be that of a
regular file. regular file.
Situations may arise where ia_hole.hi_offset and/or ia_hole.hi_offset Situations may arise where ia_hole.hi_offset and/or ia_hole.hi_offset
skipping to change at page 76, line 47 skipping to change at page 69, line 5
NFS4ERR_NOTSUPP The Hole punch operations are not supported by the NFS4ERR_NOTSUPP The Hole punch operations are not supported by the
NFS server receiving this request. NFS server receiving this request.
NFS4ERR_DIR The current filehandle is of type NF4DIR. NFS4ERR_DIR The current filehandle is of type NF4DIR.
NFS4ERR_SYMLINK The current filehandle is of type NF4LNK. NFS4ERR_SYMLINK The current filehandle is of type NF4LNK.
NFS4ERR_WRONG_TYPE The current filehandle does not designate an NFS4ERR_WRONG_TYPE The current filehandle does not designate an
ordinary file. ordinary file.
11.8. Operation 67: IO_ADVISE - Application I/O access pattern hints 12.8. Operation 67: IO_ADVISE - Application I/O access pattern hints
This section introduces a new operation, named IO_ADVISE, which This section introduces a new operation, named IO_ADVISE, which
allows NFS clients to communicate application I/O access pattern allows NFS clients to communicate application I/O access pattern
hints to the NFS server. This new operation will allow hints to be hints to the NFS server. This new operation will allow hints to be
sent to the server when applications use posix_fadvise, direct I/O, sent to the server when applications use posix_fadvise, direct I/O,
or at any other point at which the client finds useful. or at any other point at which the client finds useful.
11.8.1. ARGUMENT 12.8.1. ARGUMENT
enum IO_ADVISE_type4 { enum IO_ADVISE_type4 {
IO_ADVISE4_NORMAL = 0, IO_ADVISE4_NORMAL = 0,
IO_ADVISE4_SEQUENTIAL = 1, IO_ADVISE4_SEQUENTIAL = 1,
IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2, IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2,
IO_ADVISE4_RANDOM = 3, IO_ADVISE4_RANDOM = 3,
IO_ADVISE4_WILLNEED = 4, IO_ADVISE4_WILLNEED = 4,
IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5, IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5,
IO_ADVISE4_DONTNEED = 6, IO_ADVISE4_DONTNEED = 6,
IO_ADVISE4_NOREUSE = 7, IO_ADVISE4_NOREUSE = 7,
skipping to change at page 77, line 30 skipping to change at page 69, line 36
}; };
struct IO_ADVISE4args { struct IO_ADVISE4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 iar_stateid; stateid4 iar_stateid;
offset4 iar_offset; offset4 iar_offset;
length4 iar_count; length4 iar_count;
bitmap4 iar_hints; bitmap4 iar_hints;
}; };
11.8.2. RESULT 12.8.2. RESULT
struct IO_ADVISE4resok { struct IO_ADVISE4resok {
bitmap4 ior_hints; bitmap4 ior_hints;
}; };
union IO_ADVISE4res switch (nfsstat4 _status) { union IO_ADVISE4res switch (nfsstat4 _status) {
case NFS4_OK: case NFS4_OK:
IO_ADVISE4resok resok4; IO_ADVISE4resok resok4;
default: default:
void; void;
}; };
11.8.3. DESCRIPTION 12.8.3. DESCRIPTION
The IO_ADVISE operation sends an I/O access pattern hint to the The IO_ADVISE operation sends an I/O access pattern hint to the
server for the owner of stated for a given byte range specified by server for the owner of stated for a given byte range specified by
iar_offset and iar_count. The byte range specified by iar_offset and iar_offset and iar_count. The byte range specified by iar_offset and
iar_count need not currently exist in the file, but the iar_hints iar_count need not currently exist in the file, but the iar_hints
will apply to the byte range when it does exist. If iar_count is 0, will apply to the byte range when it does exist. If iar_count is 0,
all data following iar_offset is specified. The server MAY ignore all data following iar_offset is specified. The server MAY ignore
the advice. the advice.
The following are the possible hints: The following are the possible hints:
skipping to change at page 79, line 31 skipping to change at page 71, line 38
perhaps due to a temporary resource limitation. perhaps due to a temporary resource limitation.
Each issuance of the IO_ADVISE operation overrides all previous Each issuance of the IO_ADVISE operation overrides all previous
issuances of IO_ADVISE for a given byte range. This effectively issuances of IO_ADVISE for a given byte range. This effectively
follows a strategy of last hint wins for a given stated and byte follows a strategy of last hint wins for a given stated and byte
range. range.
Clients should assume that hints included in an IO_ADVISE operation Clients should assume that hints included in an IO_ADVISE operation
will be forgotten once the file is closed. will be forgotten once the file is closed.
11.8.4. IMPLEMENTATION 12.8.4. IMPLEMENTATION
The NFS client may choose to issue and IO_ADVISE operation to the The NFS client may choose to issue and IO_ADVISE operation to the
server in several different instances. server in several different instances.
The most obvious is in direct response to an applications execution The most obvious is in direct response to an applications execution
of posix_fadvise. In this case, IO_ADVISE4_WRITE and IO_ADVISE4_READ of posix_fadvise. In this case, IO_ADVISE4_WRITE and IO_ADVISE4_READ
may be set based upon the type of file access specified when the file may be set based upon the type of file access specified when the file
was opened. was opened.
Another useful point would be when an application indicates it is Another useful point would be when an application indicates it is
using direct I/O. Direct I/O may be specified at file open, in which using direct I/O. Direct I/O may be specified at file open, in which
case a IO_ADVISE may be included in the same compound as the OPEN case a IO_ADVISE may be included in the same compound as the OPEN
operation with the IO_ADVISE4_NOREUSE flag set. Direct I/O may also operation with the IO_ADVISE4_NOREUSE flag set. Direct I/O may also
be specified separately, in which case a IO_ADVISE operation can be be specified separately, in which case a IO_ADVISE operation can be
sent to the server separately. As above, IO_ADVISE4_WRITE and sent to the server separately. As above, IO_ADVISE4_WRITE and
IO_ADVISE4_READ may be set based upon the type of file access IO_ADVISE4_READ may be set based upon the type of file access
specified when the file was opened. specified when the file was opened.
11.8.5. pNFS File Layout Data Type Considerations 12.8.5. pNFS File Layout Data Type Considerations
The IO_ADVISE considerations for pNFS are very similar to the COMMIT The IO_ADVISE considerations for pNFS are very similar to the COMMIT
considerations for pNFS. That is, as with COMMIT, some NFS server considerations for pNFS. That is, as with COMMIT, some NFS server
implementations prefer IO_ADVISE be done on the DS, and some prefer implementations prefer IO_ADVISE be done on the DS, and some prefer
it be done on the MDS. it be done on the MDS.
So for the file's layout type, it is proposed that NFSv4.2 include an So for the file's layout type, it is proposed that NFSv4.2 include an
additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on
NFSv4.2 or higher. Any file's layout obtained with NFSv4.1 MUST NOT NFSv4.2 or higher. Any file's layout obtained with NFSv4.1 MUST NOT
have NFL42_UFLG_IO_ADVISE_THRU_MDS set. Any file's layout obtained have NFL42_UFLG_IO_ADVISE_THRU_MDS set. Any file's layout obtained
skipping to change at page 80, line 38 skipping to change at page 72, line 42
client's intended use of the file, then the client SHOULD send an client's intended use of the file, then the client SHOULD send an
IO_ADVISE to each DS. While the client MAY always send IO_ADVISE to IO_ADVISE to each DS. While the client MAY always send IO_ADVISE to
the MDS, if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the the MDS, if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the
client should expect that such an IO_ADVISE is futile. Note that a client should expect that such an IO_ADVISE is futile. Note that a
client SHOULD use the same set of arguments on each IO_ADVISE sent to client SHOULD use the same set of arguments on each IO_ADVISE sent to
a DS for the same open file reference. a DS for the same open file reference.
The server is not required to support different advice for different The server is not required to support different advice for different
DS's with the same open file reference. DS's with the same open file reference.
11.8.5.1. Dense and Sparse Packing Considerations 12.8.5.1. Dense and Sparse Packing Considerations
The IO_ADVISE operation MUST use the iar_offset and byte range as The IO_ADVISE operation MUST use the iar_offset and byte range as
dictated by the presence or absence of NFL4_UFLG_DENSE. dictated by the presence or absence of NFL4_UFLG_DENSE.
E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS
for iar_offset 0 really means iar_offset 10000 in the logical file, for iar_offset 0 really means iar_offset 10000 in the logical file,
then an IO_ADVISE for iar_offset 0 means iar_offset 10000. then an IO_ADVISE for iar_offset 0 means iar_offset 10000.
E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS
for iar_offset 0 really means iar_offset 0 in the logical file, then for iar_offset 0 really means iar_offset 0 in the logical file, then
skipping to change at page 82, line 8 skipping to change at page 74, line 14
If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and
NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request
sent to the data server with a byte range that overlaps stripe unit sent to the data server with a byte range that overlaps stripe unit
that the data server does not serve MUST NOT result in the status that the data server does not serve MUST NOT result in the status
NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and
if the server applies IO_ADVISE hints on any stripe units that if the server applies IO_ADVISE hints on any stripe units that
overlap with the specified range, those hints SHOULD be indicated in overlap with the specified range, those hints SHOULD be indicated in
the response. the response.
11.8.6. Number of Supported File Segments 12.8.6. Number of Supported File Segments
In theory IO_ADVISE allows a client and server to support multiple In theory IO_ADVISE allows a client and server to support multiple
file segments, meaning that different, possibly overlapping, byte file segments, meaning that different, possibly overlapping, byte
ranges of the same open file reference will support different hints. ranges of the same open file reference will support different hints.
This is not practical, and in general the server will support just This is not practical, and in general the server will support just
one set of hints, and these will apply to the entire file. However, one set of hints, and these will apply to the entire file. However,
there are some hints that very ephemeral, and are essentially amount there are some hints that very ephemeral, and are essentially amount
to one time instructions to the NFS server, which will be forgotten to one time instructions to the NFS server, which will be forgotten
momentarily after IO_ADVISE is executed. momentarily after IO_ADVISE is executed.
skipping to change at page 83, line 5 skipping to change at page 75, line 9
o IO_ADVISE4_NOREUSE o IO_ADVISE4_NOREUSE
The following hints are modifiers to all other hints, and will apply The following hints are modifiers to all other hints, and will apply
to the entire file and/or to a one time instruction on the specified to the entire file and/or to a one time instruction on the specified
byte range: byte range:
o IO_ADVISE4_READ o IO_ADVISE4_READ
o IO_ADVISE4_WRITE o IO_ADVISE4_WRITE
11.8.7. Possible Additional Hint - IO_ADVISE4_RECENTLY_USED 12.8.7. Possible Additional Hint - IO_ADVISE4_RECENTLY_USED
IO_ADVISE4_RECENTLY_USED The client has recently accessed the byte IO_ADVISE4_RECENTLY_USED The client has recently accessed the byte
range in its own cache. This informs the server that the data in range in its own cache. This informs the server that the data in
the byte range remains important to the client. When the server the byte range remains important to the client. When the server
reaches resource exhaustion, knowing which data is more important reaches resource exhaustion, knowing which data is more important
allows the server to make better choices about which data to, for allows the server to make better choices about which data to, for
example purge from a cache, or move to secondary storage. It also example purge from a cache, or move to secondary storage. It also
informs the server which delegations are more important, since if informs the server which delegations are more important, since if
delegations are working correctly, once delegated to a client, a delegations are working correctly, once delegated to a client, a
server might never receive another I/O request for the file. server might never receive another I/O request for the file.
skipping to change at page 83, line 42 skipping to change at page 75, line 46
unclear. For example, as most clients already cache data that they unclear. For example, as most clients already cache data that they
know is important, having this data cached twice may be unnecessary. know is important, having this data cached twice may be unnecessary.
In fact, substantial performance improvements have been demonstrated In fact, substantial performance improvements have been demonstrated
by making caches more exclusive between each other [25], not the by making caches more exclusive between each other [25], not the
other way around. This means that there is a strong argument to be other way around. This means that there is a strong argument to be
made that servers should immediately purge the described cached data made that servers should immediately purge the described cached data
upon receiving this hint. Other work showed that even infinite sized upon receiving this hint. Other work showed that even infinite sized
secondary caches can be largely ineffective [26], but this of course secondary caches can be largely ineffective [26], but this of course
is subject to the workload. is subject to the workload.
11.9. Changes to Operation 51: LAYOUTRETURN 12.9. Changes to Operation 51: LAYOUTRETURN
11.9.1. Introduction 12.9.1. Introduction
In the pNFS description provided in [2], the client is not enabled to In the pNFS description provided in [2], the client is not enabled to
relay an error code from the DS to the MDS. In the specification of relay an error code from the DS to the MDS. In the specification of
the Objects-Based Layout protocol [8], use is made of the opaque the Objects-Based Layout protocol [8], use is made of the opaque
lrf_body field of the LAYOUTRETURN argument to do such a relaying of lrf_body field of the LAYOUTRETURN argument to do such a relaying of
error codes. In this section, we define a new data structure to error codes. In this section, we define a new data structure to
enable the passing of error codes back to the MDS and provide some enable the passing of error codes back to the MDS and provide some
guidelines on what both the client and MDS should expect in such guidelines on what both the client and MDS should expect in such
circumstances. circumstances.
skipping to change at page 84, line 25 skipping to change at page 76, line 29
hard error. The MDS on the other hand, is waiting for the client to hard error. The MDS on the other hand, is waiting for the client to
report such an error. For it, the mission is accomplished in that report such an error. For it, the mission is accomplished in that
the client has returned a layout that the MDS had most likley the client has returned a layout that the MDS had most likley
recalled. recalled.
The existing LAYOUTRETURN operation is extended by introducing a new The existing LAYOUTRETURN operation is extended by introducing a new
data structure to report errors, layoutreturn_device_error4. Also, data structure to report errors, layoutreturn_device_error4. Also,
layoutreturn_device_error4 is introduced to enable an array of errors layoutreturn_device_error4 is introduced to enable an array of errors
to be reported. to be reported.
11.9.2. ARGUMENT 12.9.2. ARGUMENT
The ARGUMENT specification of the LAYOUTRETURN operation in section The ARGUMENT specification of the LAYOUTRETURN operation in section
18.44.1 of [2] is augmented by the following XDR code [24]: 18.44.1 of [2] is augmented by the following XDR code [24]:
struct layoutreturn_device_error4 { struct layoutreturn_device_error4 {
deviceid4 lrde_deviceid; deviceid4 lrde_deviceid;
nfsstat4 lrde_status; nfsstat4 lrde_status;
nfs_opnum4 lrde_opnum; nfs_opnum4 lrde_opnum;
}; };
struct layoutreturn_error_report4 { struct layoutreturn_error_report4 {
layoutreturn_device_error4 lrer_errors<>; layoutreturn_device_error4 lrer_errors<>;
}; };
11.9.3. RESULT 12.9.3. RESULT
The RESULT of the LAYOUTRETURN operation is unchanged; see section The RESULT of the LAYOUTRETURN operation is unchanged; see section
18.44.2 of [2]. 18.44.2 of [2].
11.9.4. DESCRIPTION 12.9.4. DESCRIPTION
The following text is added to the end of the LAYOUTRETURN operation The following text is added to the end of the LAYOUTRETURN operation
DESCRIPTION in section 18.44.3 of [2]. DESCRIPTION in section 18.44.3 of [2].
When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE, When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
then if the lrf_body field is NULL, it indicates to the MDS that the then if the lrf_body field is NULL, it indicates to the MDS that the
client experienced no errors. If lrf_body is non-NULL, then the client experienced no errors. If lrf_body is non-NULL, then the
field references error information which is layout type specific. field references error information which is layout type specific.
I.e., the Objects-Based Layout protocol can continue to utilize I.e., the Objects-Based Layout protocol can continue to utilize
lrf_body as specified in [8]. For both Files-Based Layouts, the lrf_body as specified in [8]. For both Files-Based Layouts, the
skipping to change at page 85, line 26 skipping to change at page 77, line 34
NFS4_OKAY: No issues were found for this device. NFS4_OKAY: No issues were found for this device.
NFS4ERR_NXIO: The client was unable to establish any communication NFS4ERR_NXIO: The client was unable to establish any communication
with the DS. with the DS.
NFS4ERR_*: The client was able to establish communication with the NFS4ERR_*: The client was able to establish communication with the
DS and is returning one of the allowed error codes for the DS and is returning one of the allowed error codes for the
operation denoted by lrde_opnum. operation denoted by lrde_opnum.
11.9.5. IMPLEMENTATION 12.9.5. IMPLEMENTATION
The following text is added to the end of the LAYOUTRETURN operation The following text is added to the end of the LAYOUTRETURN operation
IMPLEMENTATION in section 18.4.4 of [2]. IMPLEMENTATION in section 18.4.4 of [2].
A client that expects to use pNFS for a mounted filesystem SHOULD A client that expects to use pNFS for a mounted filesystem SHOULD
check for pNFS support at mount time. This check SHOULD be performed check for pNFS support at mount time. This check SHOULD be performed
by sending a GETDEVICELIST operation, followed by layout-type- by sending a GETDEVICELIST operation, followed by layout-type-
specific checks for accessibility of each storage device returned by specific checks for accessibility of each storage device returned by
GETDEVICELIST. If the NFS server does not support pNFS, the GETDEVICELIST. If the NFS server does not support pNFS, the
GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP
skipping to change at page 86, line 9 skipping to change at page 78, line 16
When an I/O fails to a storage device, the client SHOULD retry the When an I/O fails to a storage device, the client SHOULD retry the
failed I/O via the MDS. In this situation, before retrying the I/O, failed I/O via the MDS. In this situation, before retrying the I/O,
the client SHOULD return the layout, or the affected portion thereof, the client SHOULD return the layout, or the affected portion thereof,
and SHOULD indicate which storage device or devices was problematic. and SHOULD indicate which storage device or devices was problematic.
If the client does not do this, the MDS may issue a layout recall If the client does not do this, the MDS may issue a layout recall
callback in order to perform the retried I/O. callback in order to perform the retried I/O.
The client needs to be cognizant that since this error handling is The client needs to be cognizant that since this error handling is
optional in the MDS, the MDS may silently ignore this functionality. optional in the MDS, the MDS may silently ignore this functionality.
Also, as the MDS may consider some issues the client reports to be Also, as the MDS may consider some issues the client reports to be
expected (see Section 11.9.1), the client might find it difficult to expected (see Section 12.9.1), the client might find it difficult to
detect a MDS which has not implemented error handling via detect a MDS which has not implemented error handling via
LAYOUTRETURN. LAYOUTRETURN.
If an MDS is aware that a storage device is proving problematic to a If an MDS is aware that a storage device is proving problematic to a
client, the MDS SHOULD NOT include that storage device in any pNFS client, the MDS SHOULD NOT include that storage device in any pNFS
layouts sent to that client. If the MDS is aware that a storage layouts sent to that client. If the MDS is aware that a storage
device is affecting many clients, then the MDS SHOULD NOT include device is affecting many clients, then the MDS SHOULD NOT include
that storage device in any pNFS layouts sent out. Clients must still that storage device in any pNFS layouts sent out. Clients must still
be aware that the MDS might not have any choice in using the storage be aware that the MDS might not have any choice in using the storage
device, i.e., there might only be one possible layout for the system. device, i.e., there might only be one possible layout for the system.
skipping to change at page 86, line 38 skipping to change at page 78, line 45
using the problematic storage devices in layouts for that client, but using the problematic storage devices in layouts for that client, but
the MDS is not required to indefinitely retain per-client storage the MDS is not required to indefinitely retain per-client storage
device error information. An MDS is also not required to device error information. An MDS is also not required to
automatically reinstate use of a previously problematic storage automatically reinstate use of a previously problematic storage
device; administrative intervention may be required instead. device; administrative intervention may be required instead.
A client MAY perform I/O via the MDS even when the client holds a A client MAY perform I/O via the MDS even when the client holds a
layout that covers the I/O; servers MUST support this client layout that covers the I/O; servers MUST support this client
behavior, and MAY recall layouts as needed to complete I/Os. behavior, and MAY recall layouts as needed to complete I/Os.
11.10. Operation 65: READ_PLUS 12.10. Operation 65: READ_PLUS
READ_PLUS is a new read operation which allows NFS clients to avoid
reading holes in a sparse file and to efficiently transfer ADBs.
READ_PLUS is guaranteed to perform no worse than READ, and can
dramatically improve performance with sparse files.
READ_PLUS supports all the features of the existing NFSv4.1 READ
operation [2] and adds a simple yet significant extension to the
format of its response. The change allows the client to avoid
returning data for portions of the file which are either initialized
and contain no backing store or if the result would appear to be so.
I.e., if the result was a data block composed entirely of zeros, then
it is easier to return a hole. Returning data blocks of unitialized
data wastes computational and network resources, thus reducing
performance. READ_PLUS uses a new result structure that tells the
client that the result is all zeroes AND the byte-range of the hole
in which the request was made.
If the client sends a READ operation, it is explicitly stating that If the client sends a READ operation, it is explicitly stating that
it is not supporting sparse files. So if a READ occurs on a sparse it is neither supporting sparse files or ADBs. So if a READ occurs
ADB, then the server must expand such ADBs to be raw bytes. If a on a sparse ADB or file, then the server must expand such data to be
READ occurs in the middle of an ADB, the server can only send back raw bytes. If a READ occurs in the middle of a hole or ADB, the
bytes starting from that offset. server can only send back bytes starting from that offset.
Such an operation is inefficient for transfer of sparse sections of Such an operation is inefficient for transfer of sparse sections of
the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead, the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead,
a client should issue READ_PLUS. Note that as the client has no a a client should issue READ_PLUS. Note that as the client has no a
priori knowledge of whether an ADB is present or not, it should priori knowledge of whether an ADB is present or not, it should
always use READ_PLUS. always use READ_PLUS.
11.10.1. ARGUMENT 12.10.1. ARGUMENT
struct READ_PLUS4args { struct READ_PLUS4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 rpa_stateid; stateid4 rpa_stateid;
offset4 rpa_offset; offset4 rpa_offset;
count4 rpa_count; count4 rpa_count;
}; };
11.10.2. RESULT 12.10.2. RESULT
union read_plus_content switch (data_content4 content) { union read_plus_content switch (data_content4 content) {
case NFS4_CONTENT_DATA: case NFS4_CONTENT_DATA:
opaque rpc_data<>; opaque rpc_data<>;
case NFS4_CONTENT_APP_BLOCK: case NFS4_CONTENT_APP_BLOCK:
app_data_block4 rpc_block; app_data_block4 rpc_block;
case NFS4_CONTENT_HOLE: case NFS4_CONTENT_HOLE:
data_info4 rpc_hole; data_info4 rpc_hole;
default: default:
void; void;
skipping to change at page 87, line 42 skipping to change at page 80, line 33
read_plus_content rpr_contents<>; read_plus_content rpr_contents<>;
}; };
union READ_PLUS4res switch (nfsstat4 status) { union READ_PLUS4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
read_plus_res4 resok4; read_plus_res4 resok4;
default: default:
void; void;
}; };
11.10.3. DESCRIPTION 12.10.3. DESCRIPTION
Over the given range, READ_PLUS will return all data and ADBs found The READ_PLUS operation is based upon the NFSv4.1 READ operation [2],
as an array of read_plus_content. It is possible to have consecutive and similarly reads data from the regular file identified by the
ADBs in the array as either different definitions of ADBs are present current filehandle.
or as the guard pattern changes.
Edge cases exist for ABDs which either begin before the rpa_offset The client provides a rpa_offset of where the READ_PLUS is to start
requested by the READ_PLUS or end after the rpa_count requested - and a rpa_count of how many bytes are to be read. A rpa_offset of
both of which may occur as not all applications which access the file zero means to read data starting at the beginning of the file. If
are aware of the main application imposing a format on the file rpa_offset is greater than or equal to the size of the file, the
contents, i.e., tar, dd, cp, etc. READ_PLUS MUST retrieve whole status NFS4_OK is returned with di_length (the data length) set to
ADBs, but it need not retrieve an entire sequences of ADBs. zero and eof set to TRUE. READ_PLUS is subject to access permissions
checking.
The server MUST return a whole ADB because if it does not, it must The READ_PLUS result is comprised of an array of rpr_contents, each
expand that partial ADB before it sends it to the client. E.g., if of which describe a data_content4 type of data. For NFSv4.2, the
an ADB had a block size of 64k and the READ_PLUS was for 128k allowed values are data, ADB, and hole. A server is required to
starting at an offset of 32k inside the ADB, then the first 32k would support the data type, but not ADB nor hole. Both an ADB and a hole
be converted to data. must be returned in its entirety - clients must be prepared to get
more information than they requested.
11.11. Operation 66: SEEK If the data to be returned is comprised entirely of zeros, then the
server may elect to return that data as a hole. The server
differentiates this to the client by setting di_allocated to TRUE in
this case. Note that in such a scenario, the server is not required
to determine the full extent of the "hole" - it does not need to
determine where the zeros start and end.
XXX The server may elect to return adjacent elements of the same type.
For example, the guard pattern or block size of an ADB might change,
which would require adjacent elements of type ADB. Likewise if the
server has a range of data comprised entirely of zeros and then a
hole, it might want to return two adjacent holes to the client.
11.11.1. ARGUMENT If the client specifies a rpa_count value of zero, the READ_PLUS
succeeds and returns zero bytes of data, again subject to access
permissions checking. In all situations, the server may choose to
return fewer bytes than specified by the client. The client needs to
check for this condition and handle the condition appropriately.
If the client specifies an rpa_offset and rpa_count value that is
entirely contained within a hole of the file, then the di_offset and
di_length returned must be for the entire hole. This result is
considered valid until the file is changed (detected via the change
attribute). The server MUST provide the same semantics for the hole
as if the client read the region and received zeroes; the implied
holes contents lifetime MUST be exactly the same as any other read
data.
If the client specifies an rpa_offset and rpa_count value that begins
in a non-hole of the file but extends into hole the server should
return an array comprised of both data and a hole. The client MUST
be prepared for the server to reurn a short read describing just the
data. The client will then issue another READ_PLUS for the remaining
bytes, which the server will respond with information about the hole
in the file.
Except when special stateids are used, the stateid value for a
READ_PLUS request represents a value returned from a previous byte-
range lock or share reservation request or the stateid associated
with a delegation. The stateid identifies the associated owners if
any and is used by the server to verify that the associated locks are
still valid (e.g., have not been revoked).
If the read ended at the end-of-file (formally, in a correctly formed
READ_PLUS operation, if rpa_offset + rpa_count is equal to the size
of the file), or the READ_PLUS operation extends beyond the size of
the file (if rpa_offset + rpa_count is greater than the size of the
file), eof is returned as TRUE; otherwise, it is FALSE. A successful
READ_PLUS of an empty file will always return eof as TRUE.
If the current filehandle is not an ordinary file, an error will be
returned to the client. In the case that the current filehandle
represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If
the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
returned. In all other cases, NFS4ERR_WRONG_TYPE is returned.
For a READ_PLUS with a stateid value of all bits equal to zero, the
server MAY allow the READ_PLUS to be serviced subject to mandatory
byte-range locks or the current share deny modes for the file. For a
READ_PLUS with a stateid value of all bits equal to one, the server
MAY allow READ_PLUS operations to bypass locking checks at the
server.
On success, the current filehandle retains its value.
12.10.4. IMPLEMENTATION
If the server returns a short read, then the client should send
another READ_PLUS to get the remaining data. A server may return
less data than requested under several circumstances. The file may
have been truncated by another client or perhaps on the server
itself, changing the file size from what the requesting client
believes to be the case. This would reduce the actual amount of data
available to the client. It is possible that the server reduced the
transfer size and so return a short read result. Server resource
exhaustion may also occur in a short read.
If mandatory byte-range locking is in effect for the file, and if the
byte-range corresponding to the data to be read from the file is
WRITE_LT locked by an owner not associated with the stateid, the
server will return the NFS4ERR_LOCKED error. The client should try
to get the appropriate READ_LT via the LOCK operation before re-
attempting the READ_PLUS. When the READ_PLUS completes, the client
should release the byte-range lock via LOCKU. In addition, the
server MUST return an array of rpr_contents with values of that are
within the owner's locked byte range.
If another client has an OPEN_DELEGATE_WRITE delegation for the file
being read, the delegation must be recalled, and the operation cannot
proceed until that delegation is returned or revoked. Except where
this happens very quickly, one or more NFS4ERR_DELAY errors will be
returned to requests made while the delegation remains outstanding.
Normally, delegations will not be recalled as a result of a READ_PLUS
operation since the recall will occur as a result of an earlier OPEN.
However, since it is possible for a READ_PLUS to be done with a
special stateid, the server needs to check for this case even though
the client should have done an OPEN previously.
12.10.4.1. Additional pNFS Implementation Information
[[Comment.11: We need to go over this section. --TH]] With pNFS, the
semantics of using READ_PLUS remains the same. Any data server MAY
return a READ_HOLE result for a READ_PLUS request that it receives.
When a data server chooses to return a READ_HOLE result, it has the
option of returning hole information for the data stored on that data
server (as defined by the data layout), but it MUST not return a
nfs_readplusreshole structure with a byte range that includes data
managed by another data server.
1. Data servers that cannot determine hole information SHOULD return
HOLE_NOINFO.
2. Data servers that can obtain hole information for the parts of
the file stored on that data server, the data server SHOULD
return HOLE_INFO and the byte range of the hole stored on that
data server.
A data server should do its best to return as much information about
a hole as is feasible without having to contact the metadata server.
If communication with the metadata server is required, then every
attempt should be taken to minimize the number of requests.
If mandatory locking is enforced, then the data server must also
ensure that to return only information for a Hole that is within the
owner's locked byte range.
12.10.5. READ_PLUS with Sparse Files Example
The following table describes a sparse file. For each byte range,
the file contains either non-zero data or a hole. In addition, the
server in this example uses a Hole Threshold of 32K.
+-------------+----------+
| Byte-Range | Contents |
+-------------+----------+
| 0-15999 | Hole |
| 16K-31999 | Non-Zero |
| 32K-255999 | Hole |
| 256K-287999 | Non-Zero |
| 288K-353999 | Hole |
| 354K-417999 | Non-Zero |
+-------------+----------+
Table 3
Under the given circumstances, if a client was to read the file from
beginning to end with a max read size of 64K, the following will be
the result. This assumes the client has already opened the file,
acquired a valid stateid ('s' in the example), and just needs to
issue READ_PLUS requests. [[Comment.12: Change the results to match
array results. --TH]]
1. READ_PLUS(s, 0, 64K) --> NFS_OK, eof = false, data<>[32K].
Return a short read, as the last half of the request was all
zeroes. Note that the first hole is read back as all zeros as it
is below the Hole Threshhold.
2. READ_PLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range
was all zeros, and the current hole begins at offset 32K and is
224K in length.
3. READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
eof = false, data<>[32K]. Return a short read, as the last half
of the request was all zeroes.
4. READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
nfs_readplusreshole(HOLE_INFO)(288K, 66K).
5. READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
eof = true, data<>[64K].
12.11. Operation 66: SEEK
SEEK is an operation that allows a client to determine the location
of the next data_content4 in a file.
12.11.1. ARGUMENT
struct SEEK4args { struct SEEK4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 sa_stateid; stateid4 sa_stateid;
offset4 sa_offset; offset4 sa_offset;
count4 sa_count; data_content4 sa_what;
}; };
11.11.2. RESULT 12.11.2. RESULT
union seek_content switch (data_content4 content) { union seek_content switch (data_content4 content) {
case NFS4_CONTENT_DATA: case NFS4_CONTENT_DATA:
data_info4 sc_data; data_info4 sc_data;
case NFS4_CONTENT_APP_BLOCK: case NFS4_CONTENT_APP_BLOCK:
app_data_block4 sc_block; app_data_block4 sc_block;
case NFS4_CONTENT_HOLE: case NFS4_CONTENT_HOLE:
data_info4 sc_hole; data_info4 sc_hole;
default: default:
void; void;
}; };
/*
* Allow a return of an array of contents.
*/
struct seek_res4 { struct seek_res4 {
bool sr_eof; bool sr_eof;
seek_content sr_contents; seek_content sr_contents;
}; };
union SEEK4res switch (nfsstat4 status) { union SEEK4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
seek_res4 resok4; seek_res4 resok4;
default: default:
void; void;
}; };
11.11.3. DESCRIPTION 12.11.3. DESCRIPTION
Over the given range, SEEK will return a range for all data, holes, From the given sa_offset, find the next data_content4 of type sa_what
and ADBs found as an array of seek_content. It does not return in the file. For either a hole or ADB, this must return the
data_content4 in its entirety. For data, it must not return the
actual data. actual data.
12. NFSv4.2 Callback Operations SEEK must follow the same rules for stateids as READ_PLUS
(Section 12.10.3).
12.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's If the server could not find a corresponding sa_what, then the status
would still be NFS4_OK, but sr_eof would be TRUE. The sr_contents
would contain a zero-ed out content of the appropriate type.
13. NFSv4.2 Callback Operations
13.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that the File's
Attributes Changed Attributes Changed
12.1.1. ARGUMENTS 13.1.1. ARGUMENTS
struct CB_ATTR_CHANGED4args { struct CB_ATTR_CHANGED4args {
nfs_fh4 acca_fh; nfs_fh4 acca_fh;
bitmap4 acca_critical; bitmap4 acca_critical;
bitmap4 acca_info; bitmap4 acca_info;
}; };
12.1.2. RESULTS 13.1.2. RESULTS
struct CB_ATTR_CHANGED4res { struct CB_ATTR_CHANGED4res {
nfsstat4 accr_status; nfsstat4 accr_status;
}; };
12.1.3. DESCRIPTION 13.1.3. DESCRIPTION
The CB_ATTR_CHANGED callback operation is used by the server to The CB_ATTR_CHANGED callback operation is used by the server to
indicate to the client that the file's attributes have been modified indicate to the client that the file's attributes have been modified
on the server. The server does not convey how the attributes have on the server. The server does not convey how the attributes have
changed, just that they have been modified. The server can inform changed, just that they have been modified. The server can inform
the client about both critical and informational attribute changes in the client about both critical and informational attribute changes in
the bitmask arguments. The client SHOULD query the server about all the bitmask arguments. The client SHOULD query the server about all
attributes set in acca_critical. For all changes reflected in attributes set in acca_critical. For all changes reflected in
acca_info, the client can decide whether or not it wants to poll the acca_info, the client can decide whether or not it wants to poll the
server. server.
The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set The CB_ATTR_CHANGED callback operation with the FATTR4_SEC_LABEL set
in acca_critical is the method used by the server to indicate that in acca_critical is the method used by the server to indicate that
the MAC label for the file referenced by acca_fh has changed. In the MAC label for the file referenced by acca_fh has changed. In
many ways, the server does not care about the result returned by the many ways, the server does not care about the result returned by the
client. client.
12.2. Operation 15: CB_COPY - Report results of a server-side copy 13.2. Operation 15: CB_COPY - Report results of a server-side copy
13.2.1. ARGUMENT
12.2.1. ARGUMENT
union copy_info4 switch (nfsstat4 cca_status) { union copy_info4 switch (nfsstat4 cca_status) {
case NFS4_OK: case NFS4_OK:
void; void;
default: default:
length4 cca_bytes_copied; length4 cca_bytes_copied;
}; };
struct CB_COPY4args { struct CB_COPY4args {
nfs_fh4 cca_fh; nfs_fh4 cca_fh;
stateid4 cca_stateid; stateid4 cca_stateid;
copy_info4 cca_copy_info; copy_info4 cca_copy_info;
}; };
12.2.2. RESULT 13.2.2. RESULT
struct CB_COPY4res { struct CB_COPY4res {
nfsstat4 ccr_status; nfsstat4 ccr_status;
}; };
12.2.3. DESCRIPTION 13.2.3. DESCRIPTION
CB_COPY is used for both intra- and inter-server asynchronous copies. CB_COPY is used for both intra- and inter-server asynchronous copies.
The CB_COPY callback informs the client of the result of an The CB_COPY callback informs the client of the result of an
asynchronous server-side copy. This operation is sent by the asynchronous server-side copy. This operation is sent by the
destination server to the client in a CB_COMPOUND request. The copy destination server to the client in a CB_COMPOUND request. The copy
is identified by the filehandle and stateid arguments. The result is is identified by the filehandle and stateid arguments. The result is
indicated by the status field. If the copy failed, cca_bytes_copied indicated by the status field. If the copy failed, cca_bytes_copied
contains the number of bytes copied before the failure occurred. The contains the number of bytes copied before the failure occurred. The
cca_bytes_copied value indicates the number of bytes copied but not cca_bytes_copied value indicates the number of bytes copied but not
which specific bytes have been copied. which specific bytes have been copied.
skipping to change at page 91, line 41 skipping to change at page 88, line 8
If the client supports the COPY operation, the client is REQUIRED to If the client supports the COPY operation, the client is REQUIRED to
support the CB_COPY operation. support the CB_COPY operation.
The CB_COPY operation may fail for the following reasons (this is a The CB_COPY operation may fail for the following reasons (this is a
partial list): partial list):
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the NFS4ERR_NOTSUPP: The copy offload operation is not supported by the
NFS client receiving this request. NFS client receiving this request.
13. IANA Considerations 14. IANA Considerations
This section uses terms that are defined in [27]. This section uses terms that are defined in [27].
14. References 15. References
14.1. Normative References 15.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", March 1997. Levels", March 1997.
[2] Shepler, S., Eisler, M., and D. Noveck, "Network File System [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 Protocol", RFC 5661, (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
January 2010. January 2010.
[3] Haynes, T., "Network File System (NFS) Version 4 Minor Version [3] Haynes, T., "Network File System (NFS) Version 4 Minor Version
2 External Data Representation Standard (XDR) Description", 2 External Data Representation Standard (XDR) Description",
skipping to change at page 92, line 38 skipping to change at page 89, line 5
[8] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel [8] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel
NFS (pNFS) Operations", RFC 5664, January 2010. NFS (pNFS) Operations", RFC 5664, January 2010.
[9] Shepler, S., Eisler, M., and D. Noveck, "Network File System [9] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 External Data Representation (NFS) Version 4 Minor Version 1 External Data Representation
Standard (XDR) Description", RFC 5662, January 2010. Standard (XDR) Description", RFC 5662, January 2010.
[10] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS) [10] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS)
Block/Volume Layout", RFC 5663, January 2010. Block/Volume Layout", RFC 5663, January 2010.
14.2. Informative References 15.2. Informative References
[11] Haynes, T. and D. Noveck, "Network File System (NFS) version 4 [11] Haynes, T. and D. Noveck, "Network File System (NFS) version 4
Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress), Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
March 2011. March 2011.
[12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, [12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
"NSDB Protocol for Federated Filesystems", "NSDB Protocol for Federated Filesystems",
draft-ietf-nfsv4-federated-fs-protocol (Work In Progress), draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
2010. 2010.
 End of changes. 102 change blocks. 
617 lines changed or deleted 475 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/