draft-ietf-nfsv4-minorversion2-00.txt   draft-ietf-nfsv4-minorversion2-01.txt 
NFSv4 T. Haynes NFSv4 T. Haynes
Internet-Draft Editor Internet-Draft Editor
Intended status: Standards Track April 18, 2011 Intended status: Standards Track April 21, 2011
Expires: October 20, 2011 Expires: October 23, 2011
NFS Version 4 Minor Version 2 NFS Version 4 Minor Version 2
draft-ietf-nfsv4-minorversion2-00.txt draft-ietf-nfsv4-minorversion2-01.txt
Abstract Abstract
This Internet-Draft describes NFS version 4 minor version two, This Internet-Draft describes NFS version 4 minor version two,
focusing mainly on the protocol extensions made from NFS version 4 focusing mainly on the protocol extensions made from NFS version 4
minor version 0 and NFS version 4 minor version 1. Major extensions minor version 0 and NFS version 4 minor version 1. Major extensions
introduced in NFS version 4 minor version two include: Server-side introduced in NFS version 4 minor version two include: Server-side
Copy, Space Reservations, and Support for Sparse Files. Copy, Space Reservations, and Support for Sparse Files.
Requirements Language Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1]. document are to be interpreted as described in RFC 2119 [1].
Status of this Memo Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF). Note that other groups may also distribute
other groups may also distribute working documents as Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at This Internet-Draft will expire on October 23, 2011.
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on October 20, 2011.
Copyright Notice Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
skipping to change at page 2, line 15 skipping to change at page 2, line 9
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the BSD License. described in the Simplified BSD License.
This document may contain material from IETF Documents or IETF This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this 10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process. modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other it for publication as an RFC or to translate it into languages other
than English. than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . . 4 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . . 5
1.2. Scope of This Document . . . . . . . . . . . . . . . . . . 4 1.2. Scope of This Document . . . . . . . . . . . . . . . . . . 5
1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 4 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5
1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . . 4 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . . 5
1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . . 4 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . . 5
2. pNFS Access Permissions Check . . . . . . . . . . . . . . . . 4 2. pNFS LAYOUTRETURN Error Handling . . . . . . . . . . . . . . . 5
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 4 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 5
2.2. Changes to Operation 51: LAYOUTRETURN (RFC 5661) . . . . . 6 2.2. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 6
2.2.1. ARGUMENT (18.44.1) . . . . . . . . . . . . . . . . . . 7 2.2.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2. RESULT (18.44.2) . . . . . . . . . . . . . . . . . . . 8 2.2.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3. DESCRIPTION (18.44.3) . . . . . . . . . . . . . . . . 8 2.2.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 6
2.2.4. IMPLEMENTATION (18.44.4) . . . . . . . . . . . . . . . 9 2.2.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 7
2.3. Change to NFS4ERR_NXIO Usage . . . . . . . . . . . . . . . 11
2.4. Security Considerations . . . . . . . . . . . . . . . . . 11
2.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 11
3. Sharing change attribute implementation details with NFSv4 3. Sharing change attribute implementation details with NFSv4
clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12 3.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 9
3.3. Definition of the 'change_attr_type' per-file system 3.3. Definition of the 'change_attr_type' per-file system
attribute . . . . . . . . . . . . . . . . . . . . . . . . 12 attribute . . . . . . . . . . . . . . . . . . . . . . . . 9
4. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 13 4. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 10
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 14 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 11
4.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 14 4.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 11
4.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 16 4.2.1. Intra-Server Copy . . . . . . . . . . . . . . . . . . 13
4.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 17 4.2.2. Inter-Server Copy . . . . . . . . . . . . . . . . . . 14
4.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 20 4.2.3. Server-to-Server Copy Protocol . . . . . . . . . . . . 17
4.3. Operations . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3. Operations . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 22 4.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 19
4.3.2. Operation 61: COPY_NOTIFY - Notify a source server 4.3.2. Operation 61: COPY_NOTIFY - Notify a source server
of a future copy . . . . . . . . . . . . . . . . . . . 23 of a future copy . . . . . . . . . . . . . . . . . . . 20
4.3.3. Operation 62: COPY_REVOKE - Revoke a destination 4.3.3. Operation 62: COPY_REVOKE - Revoke a destination
server's copy privileges . . . . . . . . . . . . . . . 25 server's copy privileges . . . . . . . . . . . . . . . 22
4.3.4. Operation 59: COPY - Initiate a server-side copy . . . 26 4.3.4. Operation 59: COPY - Initiate a server-side copy . . . 23
4.3.5. Operation 60: COPY_ABORT - Cancel a server-side 4.3.5. Operation 60: COPY_ABORT - Cancel a server-side
copy . . . . . . . . . . . . . . . . . . . . . . . . . 34 copy . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.6. Operation 63: COPY_STATUS - Poll for status of a 4.3.6. Operation 63: COPY_STATUS - Poll for status of a
server-side copy . . . . . . . . . . . . . . . . . . . 35 server-side copy . . . . . . . . . . . . . . . . . . . 32
4.3.7. Operation 15: CB_COPY - Report results of a 4.3.7. Operation 15: CB_COPY - Report results of a
server-side copy . . . . . . . . . . . . . . . . . . . 36 server-side copy . . . . . . . . . . . . . . . . . . . 33
4.3.8. Copy Offload Stateids . . . . . . . . . . . . . . . . 37 4.3.8. Copy Offload Stateids . . . . . . . . . . . . . . . . 35
4.4. Security Considerations . . . . . . . . . . . . . . . . . 38 4.4. Security Considerations . . . . . . . . . . . . . . . . . 35
4.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 38 4.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 35
4.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 46 4.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 43
5. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 46 5. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 43
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 46 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 43
5.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1. Space Reservation . . . . . . . . . . . . . . . . . . 47 5.2.1. Space Reservation . . . . . . . . . . . . . . . . . . 45
5.2.2. Space freed on deletes . . . . . . . . . . . . . . . . 48 5.2.2. Space freed on deletes . . . . . . . . . . . . . . . . 45
5.2.3. Operations and attributes . . . . . . . . . . . . . . 49 5.2.3. Operations and attributes . . . . . . . . . . . . . . 46
5.2.4. Attribute 77: space_reserve . . . . . . . . . . . . . 49 5.2.4. Attribute 77: space_reserve . . . . . . . . . . . . . 46
5.2.5. Attribute 78: space_freed . . . . . . . . . . . . . . 49 5.2.5. Attribute 78: space_freed . . . . . . . . . . . . . . 47
5.2.6. Attribute 79: max_hole_punch . . . . . . . . . . . . . 49 5.2.6. Attribute 79: max_hole_punch . . . . . . . . . . . . . 47
5.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate 5.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate
blocks backing the file in the specified range. . . . 50 blocks backing the file in the specified range. . . . 47
5.3. Security Considerations . . . . . . . . . . . . . . . . . 51 5.3. Security Considerations . . . . . . . . . . . . . . . . . 49
5.4. IANA Considerations . . . . . . . . . . . . . . . . . . . 51 5.4. IANA Considerations . . . . . . . . . . . . . . . . . . . 49
6. Simple and Efficient Read Support for Sparse Files . . . . . . 51 6. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 51 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 49
6.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 52 6.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 50
6.3. Applications and Sparse Files . . . . . . . . . . . . . . 52 6.3. Applications and Sparse Files . . . . . . . . . . . . . . 50
6.4. Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 53 6.4. Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 51
6.5. Operation 65: READPLUS . . . . . . . . . . . . . . . . . . 54 6.5. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 52
6.5.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 55 6.5.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 53
6.5.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 55 6.5.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 54
6.5.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 55 6.5.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 54
6.5.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 57 6.5.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 56
6.5.5. READPLUS with Sparse Files Example . . . . . . . . . . 58 6.5.5. READ_PLUS with Sparse Files Example . . . . . . . . . 57
6.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . 59 6.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . 58
6.7. Security Considerations . . . . . . . . . . . . . . . . . 59 6.7. Other Proposed Designs . . . . . . . . . . . . . . . . . . 59
6.8. IANA Considerations . . . . . . . . . . . . . . . . . . . 59 6.7.1. Multi-Data Server Hole Information . . . . . . . . . . 59
6.7.2. Data Result Array . . . . . . . . . . . . . . . . . . 59
6.7.3. User-Defined Sparse Mask . . . . . . . . . . . . . . . 60
6.7.4. Allocated flag . . . . . . . . . . . . . . . . . . . . 60
6.7.5. Dense and Sparse pNFS File Layouts . . . . . . . . . . 60
7. Security Considerations . . . . . . . . . . . . . . . . . . . 60 7. Security Considerations . . . . . . . . . . . . . . . . . . . 60
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 60 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 60
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 60 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.1. Normative References . . . . . . . . . . . . . . . . . . . 60 9.1. Normative References . . . . . . . . . . . . . . . . . . . 60
9.2. Informative References . . . . . . . . . . . . . . . . . . 60 9.2. Informative References . . . . . . . . . . . . . . . . . . 61
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 62 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 62
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 62 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 63
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 62 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 63
1. Introduction 1. Introduction
1.1. The NFS Version 4 Minor Version 2 Protocol 1.1. The NFS Version 4 Minor Version 2 Protocol
The NFS version 4 minor version 2 (NFSv4.2) protocol is the third The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
minor version of the NFS version 4 (NFSv4) protocol. The first minor minor version of the NFS version 4 (NFSv4) protocol. The first minor
version, NFSv4.0, is described in [10] and the second minor version, version, NFSv4.0, is described in [10] and the second minor version,
NFSv4.1, is described in [2]. It follows the guidelines for minor NFSv4.1, is described in [2]. It follows the guidelines for minor
versioning that are listed in Section 11 of RFC 3530bis. versioning that are listed in Section 11 of RFC 3530bis.
skipping to change at page 5, line 33 skipping to change at page 5, line 33
This document describes the NFSv4.2 protocol. With respect to This document describes the NFSv4.2 protocol. With respect to
NFSv4.0 and NFSv4.1, this document does not: NFSv4.0 and NFSv4.1, this document does not:
o describe the NFSv4.0 or NFSv4.1 protocols, except where needed to o describe the NFSv4.0 or NFSv4.1 protocols, except where needed to
contrast with NFSv4.2. contrast with NFSv4.2.
o modify the specification of the NFSv4.0 or NFSv4.1 protocols. o modify the specification of the NFSv4.0 or NFSv4.1 protocols.
o clarify the NFSv4.0 or NFSv4.1 protocols. o clarify the NFSv4.0 or NFSv4.1 protocols.
The full XDR for NFSv4.2 is presented in [3].
1.3. NFSv4.2 Goals 1.3. NFSv4.2 Goals
1.4. Overview of NFSv4.2 Features 1.4. Overview of NFSv4.2 Features
1.5. Differences from NFSv4.1 1.5. Differences from NFSv4.1
2. pNFS Access Permissions Check 2. pNFS LAYOUTRETURN Error Handling
2.1. Introduction 2.1. Introduction
Figure 1 shows the overall architecture of a Parallel NFS (pNFS) In the pNFS description provided in [2], the client is not enabled to
system: relay an error code from the DS to the MDS. In the specification of
the Objects-Based Layout protocol [4], use is made of the opaque
+-----------+ lrf_body field of the LAYOUTRETURN argument to do such a relaying of
|+-----------+ +-----------+ error codes. In this section, we define a new data structure to
||+-----------+ | | enable the passing of error codes back to the MDS and provide some
||| | NFSv4.1 + pNFS | | guidelines on what both the client and MDS should expect in such
+|| Clients |<------------------------------>| MDS | circumstances.
+| | | |
+-----------+ | |
||| +-----------+
||| |
||| |
||| Storage +-----------+ |
||| Protocol |+-----------+ |
||+----------------||+-----------+ Control |
|+-----------------||| | Protocol |
+------------------+|| Storage |------------+
+| Devices |
+-----------+
Figure 1: pNFS Architecture
In this document, "storage device" is used as a general term for a
data server and/or storage server for the file, block, or object pNFS
layouts.
The current pNFS protocol [2] assumes that a client can access every
storage device (SD) included in a valid layout sent by the MDS
server, and provides no means to communicate client access failures
to the MDS. Access failures can impair pNFS performance scaling and
allow significant errors to go unreported. If the MDS can access all
the storage devices involved, but the client doesn't have sufficient
access rights to some storage devices, the client may choose to fall
back to accessing the file system using NFSV4.1 without pNFS support;
there are environments in which this behavior is undesirable,
especially if it occurs silently. An important example is addition
of a new storage device to which a large population of pNFS clients
(e.g., 1000s) lacks access permission. Layouts granted that use this
new device, result in client errors, requiring that all I/Os to that
new storage device be served by the MDS server. This creates a
performance and scalability bottleneck that may be difficult to
detect based on I/O behavior because the other storage devices are
functioning correctly.
The preferable approach to this scenario is to report the access
failures before any client attempts to issue any I/Os that can only
be serviced by the MDS server. This makes the problem explicit,
rather than forcing the MDS, or a system administrator, to diagnose
the performance problem caused by client I/O using NFS instead of
pNFS. There are limits to this approach because complex mount
structures may prevent a client from detecting this situation at
mount time, but at a minimum, access problems involving the root of
the mount structure can be detected.
The most suitable time for the client to report inability to access a
storage device is at mount time, but this is not always possible. If
the application uses a special tag or a switch to the mount command
(e.g., -pnfs) and syscall to declare its intention to use pNFS, at
the client, the client can check for both pNFS support and device
accessibility.
This document introduces an error reporting mechanism that is an
extension to the return of a pNFS layout; a pNFS client MAY use this
mechanism to inform the MDS that the layout is being returned because
one or more data servers are not accessible to the client. Error
reporting at I/O time is not affected because the result of an
inaccessible data server may not be an I/O error if a subsequent
retry of the operation via the MDS is successful.
There is a related problem scenario involving an MDS that cannot
access some storage devices and hence cannot perform I/Os on behalf
of a client. In the case of the block layout [3] if the MDS lacks
access to a storage device (e.g., LUN), MDS implementations generally
do not export any filesystem using that storage device. In contrast
to the block layout, MDSs for the file [2] and object [4] layouts may
be unable to access the storage devices that store data for an
exported filesystem. This enables a file or object layout MDS to
provide layouts that contain client-inaccessible devices. For the
specific case of adding a new storage device to a filesystem, MDS
issuance of test I/Os to the newly added device before using it in
layouts avoids this problem scenario, but does not cover loss of
access to existing storage devices at a later time.
In addition, [2] states that a client can write through or read from
the MDS, even if it has a layout; this assumes that the MDS can
access all the storage devices. This document makes that assumed
access an explicit requirement.
2.2. Changes to Operation 51: LAYOUTRETURN (RFC 5661)
The existing LAYOUTRETURN operation is extended by introducing three
new layout return types that correspond to the existing types:
o LAYOUT4_RET_REC_FILE_NO_ACCESS at file scope; There are two broad classes of errors, transient and persistent. The
client SHOULD strive to only use this new mechanism to report
persistent errors. It MUST be able to deal with transient issues by
itself. Also, while the client might consider an issue to be
persistent, it MUST be prepared for the MDS to consider such issues
to be persistent. A prime example of this is if the MDS fences off a
client from either a stateid or a filehandle. The client will get an
error from the DS and might relay either NFS4ERR_ACCESS or
NFS4ERR_STALE_STATEID back to the MDS, with the belief that this is a
hard error. The MDS on the other hand, is waiting for the client to
report such an error. For it, the mission is accomplished in that
the client has returned a layout that the MDS had most likley
recalled.
o LAYOUT4_RET_REC_FSID_NO_ACCESS at fsid scope; and 2.2. Changes to Operation 51: LAYOUTRETURN
o LAYOUT4_RET_REC_ALL_NO_ACCESS at client scope.
The first return type returns the layout for an individual file and The existing LAYOUTRETURN operation is extended by introducing a new
informs the server that the reason for the return is a storage device data structure to report errors, layoutreturn_device_error4. Also,
connectivity problem. The second return type performs that function layoutreturn_device_error4 is introduced to enable an array of errors
for all layouts held by the client for the filesystem that to be reported.
corresponds to the current filehandle used for the LAYOUTRETURN
operation. The third return type performs that function for all
layouts held by the client; it is intended for situations in which a
device is shared across all or most of the filesystems from a server
for which the client has layouts.
2.2.1. ARGUMENT (18.44.1) 2.2.1. ARGUMENT
The ARGUMENT specification of the LAYOUTRETURN operation in section The ARGUMENT specification of the LAYOUTRETURN operation in section
18.44.1 of [2] is replaced by the following XDR code [11]: 18.44.1 of [2] is augmented by the following XDR code [11]:
/* Constants used for new LAYOUTRETURN and CB_LAYOUTRECALL */
const LAYOUT4_RET_REC_FILE = 1;
const LAYOUT4_RET_REC_FSID = 2;
const LAYOUT4_RET_REC_ALL = 3;
const LAYOUT4_RET_REC_FILE_NO_ACCESS = 4;
const LAYOUT4_RET_REC_FSID_NO_ACESSS = 5;
const LAYOUT4_RET_REC_ALL_NO_ACCESS = 6;
enum layoutreturn_type4 {
LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE,
LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID,
LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL,
LAYOUTRETURN4_FILE_NO_ACCESS = LAYOUT4_RET_REC_FILE_NO_ACCESS,
LAYOUTRETURN4_FSID_NO_ACCESS = LAYOUT4_RET_REC_FSID_NO_ACCESS,
LAYOUTRETURN4_ALL_NO_ACCESS = LAYOUT4_RET_REC_ALL_NO_ACCESS
};
struct layoutreturn_file4 {
offset4 lrf_offset;
length4 lrf_length;
stateid4 lrf_stateid;
/* layouttype4 specific data */
opaque lrf_body<>;
};
struct layoutreturn_device_no_access4 {
deviceid4 lrdna_deviceid;
nfsstat4 lrdna_status;
};
struct layoutreturn_file_no_access4 { struct layoutreturn_device_error4 {
offset4 lrfna_offset; deviceid4 lrde_deviceid;
length4 lrfna_length; nfsstat4 lrde_status;
stateid4 lrfna_stateid; nfs_opnum4 lrde_opnum;
deviceid4 lrfna_deviceid;
nfsstat4 lrfna_status;
/* layouttype4 specific data */
opaque lrfna_body<>;
}; };
union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { struct layoutreturn_error_report4 {
case LAYOUTRETURN4_FILE: layoutreturn_device_error4 lrer_errors<>;
layoutreturn_file4 lr_layout;
case LAYOUTRETURN4_FILE_NO_ACCESS:
layoutreturn_file_no_access4 lr_layout_na;
case LAYOUTRETURN4_FSID_NO_ACCESS:
case LAYOUTRETURN4_ALL_NO_ACCESS:
layoutreturn_device_no_access4 lr_device<>;
default:
void;
}; };
2.2.2. RESULT (18.44.2) 2.2.2. RESULT
The RESULT of the LAYOUTRETURN operation is unchanged; see section The RESULT of the LAYOUTRETURN operation is unchanged; see section
18.44.2 of [2] 18.44.2 of [2].
2.2.3. DESCRIPTION (18.44.3) 2.2.3. DESCRIPTION
The following text is added to the end of the LAYOUTRETURN operation The following text is added to the end of the LAYOUTRETURN operation
DESCRIPTION in section 18.44.3 of [2] DESCRIPTION in section 18.44.3 of [2].
There are three NO_ACCESS layoutreturn_type4 values that indicate a When a client used LAYOUTRETURN with a type of LAYOUTRETURN4_FILE,
persistent lack of client ability to access storage device(s), then if the lrf_body field is NULL, it indicates to the MDS that the
LAYOUT4_RET_REC_FILE_NO_ACCESS, LAYOUT4_RET_REC_FSID_NO_ACCESS and client experienced no errors. If lrf_body is non-NULL, then the
LAYOUT4_RET_REC_ALL_NO_ACCESS. A client uses these return types to field references error information which is layout type specific.
return a layout (or portion thereof) for a file, return all layouts I.e., the Objects-Based Layout protocol can continue to utilize
for an FSID or all layouts from that server held by the client, and lrf_body as specified in [4]. For both Files-Based Layouts, the
in all cases to inform the server that the reason for the return is field references a layoutreturn_device_error4, which contains an
the client's inability to access one or more storage devices. The array of layoutreturn_device_error4.
same stateid may be used or the client MAY force use of a new stateid
in order to report a new error.
An NFS error value (nfsstat4) is included for each device for these Each individual layoutreturn_device_error4 descibes a single error
three NO_ACCESS return types to provide additional information on the associated with a DS, which is identfied via lrde_deviceid. The
cause. The allowed NFS errors are those that are valid for an NFS operation which returned the error is identified via lrde_opnum.
READ or WRITE operation, and NFS4ERR_NXIO is also allowed to report Finally the NFS error value (nfsstat4) encountered is provided via
an inaccessible device. The server SHOULD log the received NFS error lrde_status and may consist of the following error codes:
value, but that error value does not affect server processing of the
LAYOUTRETURN operation. All uses of the NO_ACCESS layout return
types that report NFS errors SHOULD be logged by the client.
The client MAY use the new LAYOUT4_RET_REC_FILE_NO_ACCESS when only NFS4_OKAY: No issues were found for this device.
one file, or a small number of files are affected. If the access
problem affects multiple devices, the client may use multiple file
layout return operations; each return operation SHOULD return a
layout extent obtained from the device for which an error is being
reported. In contrast, both LAYOUT4_RET_REC_FSID_NO_ACCESS and
LAYOUT4_RET_REC_ALL_NO_ACCESS include an array of <device, status>
pairs to enable a single operation to report errors for multiple
devices in a single operation.
2.2.4. IMPLEMENTATION (18.44.4) NFS4ERR_NXIO: The client was unable to establish any communication
with the DS.
NFS4ERR_*: The client was able to establish communication with the
DS and is returning one of the allowed error codes for the
operation denoted by lrde_opnum.
2.2.4. IMPLEMENTATION
The following text is added to the end of the LAYOUTRETURN operation The following text is added to the end of the LAYOUTRETURN operation
IMPLEMENTATION in section 18.4.4 of [2] IMPLEMENTATION in section 18.4.4 of [2].
A client that expects to use pNFS for a mounted filesystem SHOULD A client that expects to use pNFS for a mounted filesystem SHOULD
check for pNFS support at mount time. This check SHOULD be performed check for pNFS support at mount time. This check SHOULD be performed
by sending a GETDEVICELIST operation, followed by layout-type- by sending a GETDEVICELIST operation, followed by layout-type-
specific checks for accessibility of each storage device returned by specific checks for accessibility of each storage device returned by
GETDEVICELIST. If the NFS server does not support pNFS, the GETDEVICELIST. If the NFS server does not support pNFS, the
GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP GETDEVICELIST operation will be rejected with an NFS4ERR_NOTSUPP
error; in this situation it is up to the client to determine whether error; in this situation it is up to the client to determine whether
it is acceptable to proceed with NFS-only access. it is acceptable to proceed with NFS-only access.
Clients are expected to tolerate transient storage device errors, and Clients are expected to tolerate transient storage device errors, and
hence clients SHOULD NOT use the NO_ACCESS layout return types for hence clients SHOULD NOT use the LAYOUTRETURN error handling for
device access problems that may be transient. The methods by which a device access problems that may be transient. The methods by which a
client decides whether an access problem is transient vs. persistent client decides whether an access problem is transient vs. persistent
are implementation-specific, but may include retrying I/Os to a data are implementation-specific, but may include retrying I/Os to a data
server under appropriate conditions. server under appropriate conditions.
When an I/O fails because a storage device is inaccessible, the When an I/O fails to a storage device, the client SHOULD retry the
client SHOULD retry the failed I/O via the MDS. In this situation, failed I/O via the MDS. In this situation, before retrying the I/O,
before retrying the I/O, the client SHOULD return the layout, or the client SHOULD return the layout, or the affected portion thereof,
inaccessible portion thereof, and SHOULD indicate which storage and SHOULD indicate which storage device or devices was problematic.
device or devices was or were inaccessible. If the client does not If the client does not do this, the MDS may issue a layout recall
do this, the MDS may issue a layout recall callback in order to callback in order to perform the retried I/O.
perform the retried I/O.
Backwards compatibility may require a client to perform two layout The client needs to be cognizant that since this error handling is
return operations to deal with servers that don't implement the optional in the MDS, the MDS may silently ignore this functionality.
NO_ACCESS layoutreturn_type4 values and hence respond to them with Also, as the MDS may consider some issues the client reports to be
NFS4ERR_INVAL. In this situation, the client SHOULD perform an expected (see Section 2.1), the client might find it difficult to
ordinary layout return operation and remember that the new layout detect a MDS which has not implemented error handling via
NO_ACCESS return types are not to be used with that server. LAYOUTRETURN.
The metadata server (MDS) SHOULD NOT use storage devices in pNFS If an MDS is aware that a storage device is proving problematic to a
layouts that are not accessible to the MDS. At a minimum, the server client, the MDS SHOULD NOT include that storage device in any pNFS
SHOULD check its own storage device accessibility before exporting a layouts sent to that client. If the MDS is aware that a storage
filesystem that supports pNFS and when the device configuration for device is affecting many clients, then the MDS SHOULD NOT include
such an exported filesystem is changed (e.g., to add a storage that storage device in any pNFS layouts sent out. Clients must still
device). be aware that the MDS might not have any choice in using the storage
device, i.e., there might only be one possible layout for the system.
If an MDS is aware that a storage device is inaccessible to a client, Another interesting complication is that for existing files, the MDS
the MDS SHOULD NOT include that storage device in any pNFS layouts might have no choice in which storage devices to hand out to clients.
sent to that client. An MDS SHOULD react to a client return of The MDS might try to restripe a file across a different storage
inaccessible layouts by not using the inaccessible storage devices in device, but clients need to be aware that not all implementations
layouts for that client, but the MDS is not required to indefinitely have restriping support.
retain per-client storage device inaccessibility information. An MDS
is also not required to automatically reinstate use of a previously An MDS SHOULD react to a client return of layouts with errors by not
inaccessible storage device; administrative intervention may be using the problematic storage devices in layouts for that client, but
required instead. the MDS is not required to indefinitely retain per-client storage
device error information. An MDS is also not required to
automatically reinstate use of a previously problematic storage
device; administrative intervention may be required instead.
A client MAY perform I/O via the MDS even when the client holds a A client MAY perform I/O via the MDS even when the client holds a
layout that covers the I/O; servers MUST support this client layout that covers the I/O; servers MUST support this client
behavior, and MAY recall layouts as needed to complete I/Os. behavior, and MAY recall layouts as needed to complete I/Os.
2.2.4.1. Storage Device Error Mapping (18.44.4.1, new)
The following text is added as new subsection 18.44.4.1 of [2]
An NFS error value is sent for each device that the client reports as
inaccessible via a NO_ACCESS layout return type. In general:
o If the client is unable to access the storage device, NFS4ERR_NXIO
SHOULD be used.
o If the client is able to access the storage device, but permission
is denied, NFS4ERR_ACCESS SHOULD be used.
Beyond these two rules, error code usage is layout-type specific:
o For the pNFS file layout, an indicative NFS error from a failed
read or write operation on the inaccessible device SHOULD be used.
o For the pNFS block layout, other errors from the Storage Protocol
SHOULD be mapped to NFS4ERR_IO. In addition, the client SHOULD
log information about the actual storage protocol error (e.g.,
SCSI status and sense data), but that information is not sent to
the pNFS server.
o For the pNFS object layout, occurrences of the object error types
specified in [4] SHOULD be mapped to the following NFS errors for
use in LAYOUTRETURN:
* PNFS_OSD_ERR_EIO -> NFS4ERR_IO
* PNFS_OSD_ERR_NOT_FOUND -> NFS4ERR_STALE
* PNFS_OSD_ERR_NO_SPACE -> NFS4ERR_NOSPC
* PNFS_OSD_ERR_BAD_CRED -> NFS4ERR_INVAL
* PNFS_OSD_ERR_NO_ACCESS -> NFS4ERR_ACCESS
* PNFS_OSD_ERR_UNREACHABLE -> NFS4ERR_NXIO
* PNFS_OSD_ERR_RESOURCE -> NFS4ERR_SERVERFAULT
The LAYOUTRETURN NO_ACCESS return types are used for persistent
device errors; they do not replace other error reporting mechanisms
that also apply to transient errors (e.g., as specified for the
object layout in [4]).
2.3. Change to NFS4ERR_NXIO Usage
This document specifies that the NFS4ERR_NXIO error SHOULD be used to
report an inaccessible storage device. To enable that usage, this
document updates [2] to allow use of the currently obsolete
NFS4ERR_NXIO error in the ARGUMENT of LAYOUTRETURN; NFS4ERR_NXIO
remains obsolete for all other uses of NFS errors.
2.4. Security Considerations
This section adds a small extension to the NFSv4 LAYOUTRETURN
operation. The NFS and pNFS security considerations in [2], [3], and
[4] apply to the extended LAYOUTRETURN operation.
2.5. IANA Considerations
There are no additional IANA considerations in this section beyond
the IANA Considerations covered in [2]
3. Sharing change attribute implementation details with NFSv4 clients 3. Sharing change attribute implementation details with NFSv4 clients
3.1. Abstract 3.1. Abstract
This document describes an extension to the NFSv4 protocol that This document describes an extension to the NFSv4 protocol that
allows the server to share information about the implementation of allows the server to share information about the implementation of
its change attribute with the client. The aim is to improve the its change attribute with the client. The aim is to improve the
client's ability to determine the order in which parallel updates to client's ability to determine the order in which parallel updates to
the same file were processed. the same file were processed.
skipping to change at page 13, line 32 skipping to change at page 9, line 32
third GETATTR that is fully serialised with the first two. third GETATTR that is fully serialised with the first two.
In order to avoid this kind of inefficiency, we propose a method to In order to avoid this kind of inefficiency, we propose a method to
allow the server to share details about how the change attribute is allow the server to share details about how the change attribute is
expected to evolve, so that the client may immediately determine expected to evolve, so that the client may immediately determine
which, out of the several change attribute values returned by the which, out of the several change attribute values returned by the
server, is the most recent. server, is the most recent.
3.3. Definition of the 'change_attr_type' per-file system attribute 3.3. Definition of the 'change_attr_type' per-file system attribute
enum change_attr_typeinfo = { enum change_attr_typeinfo {
NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0, NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR = 0,
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1, NFS4_CHANGE_TYPE_IS_VERSION_COUNTER = 1,
NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2, NFS4_CHANGE_TYPE_IS_VERSION_COUNTER_NOPNFS = 2,
NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3, NFS4_CHANGE_TYPE_IS_TIME_METADATA = 3,
NFS4_CHANGE_TYPE_IS_UNDEFINED = 4 NFS4_CHANGE_TYPE_IS_UNDEFINED = 4
}; };
+------------------+----+---------------------------+-----+ +------------------+----+---------------------------+-----+
| Name | Id | Data Type | Acc | | Name | Id | Data Type | Acc |
+------------------+----+---------------------------+-----+ +------------------+----+---------------------------+-----+
skipping to change at page 17, line 26 skipping to change at page 13, line 26
4.2.1. Intra-Server Copy 4.2.1. Intra-Server Copy
To copy a file on a single server, the client uses a COPY operation. To copy a file on a single server, the client uses a COPY operation.
The server may respond to the copy operation with the final results The server may respond to the copy operation with the final results
of the copy or it may perform the copy asynchronously and deliver the of the copy or it may perform the copy asynchronously and deliver the
results using a CB_COPY operation callback. If the copy is performed results using a CB_COPY operation callback. If the copy is performed
asynchronously, the client may poll the status of the copy using asynchronously, the client may poll the status of the copy using
COPY_STATUS or cancel the copy using COPY_ABORT. COPY_STATUS or cancel the copy using COPY_ABORT.
A synchronous intra-server copy is shown in Figure 2. In this A synchronous intra-server copy is shown in Figure 1. In this
example, the NFS server chooses to perform the copy synchronously. example, the NFS server chooses to perform the copy synchronously.
The copy operation is completed, either successfully or The copy operation is completed, either successfully or
unsuccessfully, before the server replies to the client's request. unsuccessfully, before the server replies to the client's request.
The server's reply contains the final result of the operation. The server's reply contains the final result of the operation.
Client Server Client Server
+ + + +
| | | |
|--- COPY ---------------------------->| Client requests |--- COPY ---------------------------->| Client requests
|<------------------------------------/| a file copy |<------------------------------------/| a file copy
| | | |
| | | |
Figure 2: A synchronous intra-server copy. Figure 1: A synchronous intra-server copy.
An asynchronous intra-server copy is shown in Figure 3. In this An asynchronous intra-server copy is shown in Figure 2. In this
example, the NFS server performs the copy asynchronously. The example, the NFS server performs the copy asynchronously. The
server's reply to the copy request indicates that the copy operation server's reply to the copy request indicates that the copy operation
was initiated and the final result will be delivered at a later time. was initiated and the final result will be delivered at a later time.
The server's reply also contains a copy stateid. The client may use The server's reply also contains a copy stateid. The client may use
this copy stateid to poll for status information (as shown) or to this copy stateid to poll for status information (as shown) or to
cancel the copy using a COPY_ABORT. When the server completes the cancel the copy using a COPY_ABORT. When the server completes the
copy, the server performs a callback to the client and reports the copy, the server performs a callback to the client and reports the
results. results.
Client Server Client Server
skipping to change at page 18, line 23 skipping to change at page 14, line 23
|<------------------------------------/| for status |<------------------------------------/| for status
| | | |
| . | Multiple COPY_STATUS | . | Multiple COPY_STATUS
| . | operations may be sent. | . | operations may be sent.
| . | | . |
| | | |
|<-- CB_COPY --------------------------| Server reports results |<-- CB_COPY --------------------------| Server reports results
|\------------------------------------>| |\------------------------------------>|
| | | |
Figure 3: An asynchronous intra-server copy. Figure 2: An asynchronous intra-server copy.
4.2.2. Inter-Server Copy 4.2.2. Inter-Server Copy
A copy may also be performed between two servers. The copy protocol A copy may also be performed between two servers. The copy protocol
is designed to accommodate a variety of network topologies. As shown is designed to accommodate a variety of network topologies. As shown
in Figure 4, the client and servers may be connected by multiple in Figure 3, the client and servers may be connected by multiple
networks. In particular, the servers may be connected by a networks. In particular, the servers may be connected by a
specialized, high speed network (network 192.168.33.0/24 in the specialized, high speed network (network 192.168.33.0/24 in the
diagram) that does not include the client. The protocol allows the diagram) that does not include the client. The protocol allows the
client to setup the copy between the servers (over network client to setup the copy between the servers (over network
10.11.78.0/24 in the diagram) and for the servers to communicate on 10.11.78.0/24 in the diagram) and for the servers to communicate on
the high speed network if they choose to do so. the high speed network if they choose to do so.
192.168.33.0/24 192.168.33.0/24
+-------------------------------------+ +-------------------------------------+
| | | |
skipping to change at page 19, line 25 skipping to change at page 15, line 25
| | | |
| 10.11.78.0/24 | | 10.11.78.0/24 |
+------------------+------------------+ +------------------+------------------+
| |
| |
| 10.11.78.243 | 10.11.78.243
+-----+-----+ +-----+-----+
| Client | | Client |
+-----------+ +-----------+
Figure 4: An example inter-server network topology. Figure 3: An example inter-server network topology.
For an inter-server copy, the client notifies the source server that For an inter-server copy, the client notifies the source server that
a file will be copied by the destination server using a COPY_NOTIFY a file will be copied by the destination server using a COPY_NOTIFY
operation. The client then initiates the copy by sending the COPY operation. The client then initiates the copy by sending the COPY
operation to the destination server. The destination server may operation to the destination server. The destination server may
perform the copy synchronously or asynchronously. perform the copy synchronously or asynchronously.
A synchronous inter-server copy is shown in Figure 5. In this case, A synchronous inter-server copy is shown in Figure 4. In this case,
the destination server chooses to perform the copy before responding the destination server chooses to perform the copy before responding
to the client's COPY request. to the client's COPY request.
An asynchronous copy is shown in Figure 6. In this case, the An asynchronous copy is shown in Figure 5. In this case, the
destination server chooses to respond to the client's COPY request destination server chooses to respond to the client's COPY request
immediately and then perform the copy asynchronously. immediately and then perform the copy asynchronously.
Client Source Destination Client Source Destination
+ + + + + +
| | | | | |
|--- COPY_NOTIFY --->| | |--- COPY_NOTIFY --->| |
|<------------------/| | |<------------------/| |
| | | | | |
| | | | | |
skipping to change at page 20, line 26 skipping to change at page 16, line 26
| |\--------------->| | |\--------------->|
| | | | | |
| | . | Multiple reads may | | . | Multiple reads may
| | . | be necessary | | . | be necessary
| | . | | | . |
| | | | | |
| | | | | |
|<------------------------------------/| Destination replies |<------------------------------------/| Destination replies
| | | to COPY | | | to COPY
Figure 5: A synchronous inter-server copy. Figure 4: A synchronous inter-server copy.
Client Source Destination Client Source Destination
+ + + + + +
| | | | | |
|--- COPY_NOTIFY --->| | |--- COPY_NOTIFY --->| |
|<------------------/| | |<------------------/| |
| | | | | |
| | | | | |
|--- COPY ---------------------------->| |--- COPY ---------------------------->|
|<------------------------------------/| |<------------------------------------/|
skipping to change at page 21, line 37 skipping to change at page 17, line 37
| | . | Multiple COPY_STATUS | | . | Multiple COPY_STATUS
| | . | operations may be sent | | . | operations may be sent
| | . | | | . |
| | | | | |
| | | | | |
| | | | | |
|<-- CB_COPY --------------------------| Destination reports |<-- CB_COPY --------------------------| Destination reports
|\------------------------------------>| results |\------------------------------------>| results
| | | | | |
Figure 6: An asynchronous inter-server copy. Figure 5: An asynchronous inter-server copy.
4.2.3. Server-to-Server Copy Protocol 4.2.3. Server-to-Server Copy Protocol
During an inter-server copy, the destination server reads the file During an inter-server copy, the destination server reads the file
data from the source server. The source server and destination data from the source server. The source server and destination
server are not required to use a specific protocol to transfer the server are not required to use a specific protocol to transfer the
file data. The choice of what protocol to use is ultimately the file data. The choice of what protocol to use is ultimately the
destination server's decision. destination server's decision.
4.2.3.1. Using NFSv4.x as a Server-to-Server Copy Protocol 4.2.3.1. Using NFSv4.x as a Server-to-Server Copy Protocol
skipping to change at page 23, line 51 skipping to change at page 19, line 51
If the netloc4 is of type NL4_NAME, the nl_name field MUST be If the netloc4 is of type NL4_NAME, the nl_name field MUST be
specified as a UTF-8 string. The nl_name is expected to be resolved specified as a UTF-8 string. The nl_name is expected to be resolved
to a network address via DNS, LDAP, NIS, /etc/hosts, or some other to a network address via DNS, LDAP, NIS, /etc/hosts, or some other
means. If the netloc4 is of type NL4_URL, a server URL [5] means. If the netloc4 is of type NL4_URL, a server URL [5]
appropriate for the server-to-server copy operation is specified as a appropriate for the server-to-server copy operation is specified as a
UTF-8 string. If the netloc4 is of type NL4_NETADDR, the nl_addr UTF-8 string. If the netloc4 is of type NL4_NETADDR, the nl_addr
field MUST contain a valid netaddr4 as defined in Section 3.3.9 of field MUST contain a valid netaddr4 as defined in Section 3.3.9 of
[2]. [2].
When netloc4 values are used for an inter-server copy as shown in When netloc4 values are used for an inter-server copy as shown in
Figure 4, their values may be evaluated on the source server, Figure 3, their values may be evaluated on the source server,
destination server, and client. The network environment in which destination server, and client. The network environment in which
these systems operate should be configured so that the netloc4 values these systems operate should be configured so that the netloc4 values
are interpreted as intended on each system. are interpreted as intended on each system.
4.3.2. Operation 61: COPY_NOTIFY - Notify a source server of a future 4.3.2. Operation 61: COPY_NOTIFY - Notify a source server of a future
copy copy
4.3.2.1. ARGUMENT 4.3.2.1. ARGUMENT
struct COPY_NOTIFY4args { struct COPY_NOTIFY4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: source file */
netloc4 cna_destination_server; netloc4 cna_destination_server;
}; };
4.3.2.2. RESULT 4.3.2.2. RESULT
struct COPY_NOTIFY4resok {
nfstime4 cnr_lease_time;
netloc4 cnr_source_server<>;
};
union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { union COPY_NOTIFY4res switch (nfsstat4 cnr_status) {
case NFS4_OK: case NFS4_OK:
nfstime4 cnr_lease_time; COPY_NOTIFY4resok resok4;
netloc4 cnr_source_server<>;
default: default:
void; void;
}; };
4.3.2.3. DESCRIPTION 4.3.2.3. DESCRIPTION
This operation is used for an inter-server copy. A client sends this This operation is used for an inter-server copy. A client sends this
operation in a COMPOUND request to the source server to authorize a operation in a COMPOUND request to the source server to authorize a
destination server identified by cna_destination_server to read the destination server identified by cna_destination_server to read the
file specified by CURRENT_FH on behalf of the given user. file specified by CURRENT_FH on behalf of the given user.
skipping to change at page 28, line 8 skipping to change at page 24, line 8
offset4 ca_dst_offset; offset4 ca_dst_offset;
length4 ca_count; length4 ca_count;
uint32_t ca_flags; uint32_t ca_flags;
component4 ca_destination; component4 ca_destination;
netloc4 ca_source_server<>; netloc4 ca_source_server<>;
}; };
4.3.4.2. RESULT 4.3.4.2. RESULT
union COPY4res switch (nfsstat4 cr_status) { union COPY4res switch (nfsstat4 cr_status) {
/* CURRENT_FH: destination file */
case NFS4_OK: case NFS4_OK:
stateid4 cr_callback_id<1>; stateid4 cr_callback_id<1>;
default: default:
length4 cr_bytes_copied; length4 cr_bytes_copied;
}; };
4.3.4.3. DESCRIPTION 4.3.4.3. DESCRIPTION
The COPY operation is used for both intra- and inter-server copies. The COPY operation is used for both intra- and inter-server copies.
In both cases, the COPY is always sent from the client to the In both cases, the COPY is always sent from the client to the
skipping to change at page 36, line 35 skipping to change at page 32, line 35
4.3.6.1. ARGUMENT 4.3.6.1. ARGUMENT
struct COPY_STATUS4args { struct COPY_STATUS4args {
/* CURRENT_FH: destination file */ /* CURRENT_FH: destination file */
stateid4 csa_stateid; stateid4 csa_stateid;
}; };
4.3.6.2. RESULT 4.3.6.2. RESULT
struct COPY_STATUS4resok {
length4 csr_bytes_copied;
nfsstat4 csr_complete<1>;
};
union COPY_STATUS4res switch (nfsstat4 csr_status) { union COPY_STATUS4res switch (nfsstat4 csr_status) {
case NFS4_OK: case NFS4_OK:
length4 csr_bytes_copied; COPY_STATUS4resok resok4;
nfsstat4 csr_complete<1>;
default: default:
void; void;
}; };
4.3.6.3. DESCRIPTION 4.3.6.3. DESCRIPTION
COPY_STATUS is used for both intra- and inter-server asynchronous COPY_STATUS is used for both intra- and inter-server asynchronous
copies. The COPY_STATUS operation allows the client to poll the copies. The COPY_STATUS operation allows the client to poll the
server to determine the status of an asynchronous copy operation. server to determine the status of an asynchronous copy operation.
This operation is sent by the client to the destination server. This operation is sent by the client to the destination server.
skipping to change at page 46, line 37 skipping to change at page 43, line 8
When the client sends the source server the COPY_NOTIFY operation, When the client sends the source server the COPY_NOTIFY operation,
the source server may reply to the client with a list of target the source server may reply to the client with a list of target
addresses, names, and/or URLs and assign them to the unique triple: addresses, names, and/or URLs and assign them to the unique triple:
<source fh, user ID, destination address Y>. If the destination uses <source fh, user ID, destination address Y>. If the destination uses
one of these target netlocs to contact the source server, the source one of these target netlocs to contact the source server, the source
server will be able to uniquely identify the destination server, even server will be able to uniquely identify the destination server, even
if the destination server does not connect from the address specified if the destination server does not connect from the address specified
by the client in COPY_NOTIFY. by the client in COPY_NOTIFY.
For example, suppose the network topology is as shown in Figure 4. For example, suppose the network topology is as shown in Figure 3.
If the source filehandle is 0x12345, the source server may respond to If the source filehandle is 0x12345, the source server may respond to
a COPY_NOTIFY for destination 10.11.78.56 with the URLs: a COPY_NOTIFY for destination 10.11.78.56 with the URLs:
nfs://10.11.78.18//_COPY/10.11.78.56/_FH/0x12345 nfs://10.11.78.18//_COPY/10.11.78.56/_FH/0x12345
nfs://192.168.33.18//_COPY/10.11.78.56/_FH/0x12345 nfs://192.168.33.18//_COPY/10.11.78.56/_FH/0x12345
The client will then send these URLs to the destination server in the The client will then send these URLs to the destination server in the
COPY operation. Suppose that the 192.168.33.0/24 network is a high COPY operation. Suppose that the 192.168.33.0/24 network is a high
speed network and the destination server decides to transfer the file speed network and the destination server decides to transfer the file
skipping to change at page 51, line 19 skipping to change at page 47, line 38
5.2.7.1. ARGUMENT 5.2.7.1. ARGUMENT
struct HOLE_PUNCH4args { struct HOLE_PUNCH4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
offset4 hpa_offset; offset4 hpa_offset;
length4 hpa_count; length4 hpa_count;
}; };
5.2.7.2. RESULT 5.2.7.2. RESULT
struct HOLEPUNCH4res { struct HOLE_PUNCH4res {
nfsstat4 hpr_status; nfsstat4 hpr_status;
}; };
5.2.7.3. DESCRIPTION 5.2.7.3. DESCRIPTION
Whenever a client wishes to deallocate the blocks backing a Whenever a client wishes to deallocate the blocks backing a
particular region in the file, it calls the HOLE_PUNCH operation with particular region in the file, it calls the HOLE_PUNCH operation with
the current filehandle set to the filehandle of the file in question, the current filehandle set to the filehandle of the file in question,
start offset and length in bytes of the region set in hpa_offset and start offset and length in bytes of the region set in hpa_offset and
hpa_count respectively. All further reads to this region MUST return hpa_count respectively. All further reads to this region MUST return
skipping to change at page 52, line 34 skipping to change at page 49, line 13
ordinary file. ordinary file.
5.3. Security Considerations 5.3. Security Considerations
There are no security considerations for this section. There are no security considerations for this section.
5.4. IANA Considerations 5.4. IANA Considerations
This section has no actions for IANA. This section has no actions for IANA.
6. Simple and Efficient Read Support for Sparse Files 6. Sparse Files
6.1. Introduction 6.1. Introduction
NFS is now used in many data centers as the sole or primary method of
data access. Consequently, more types of applications are using NFS
than ever before, each with their own requirements and generated
workloads. As part of this, sparse files are increasing in number
while NFS continues to lack any specific knowledge of a sparse file's
layout. This document puts forth a proposal for the NFSv4.2 protocol
to support efficient reading of sparse files.
A sparse file is a common way of representing a large file without A sparse file is a common way of representing a large file without
having to reserve disk space for it. Consequently, a sparse file having to utilize all of the disk space for it. Consequently, a
uses less physical space than its size indicates. This means the sparse file uses less physical space than its size indicates. This
file contains 'holes', byte ranges within the file that contain no means the file contains 'holes', byte ranges within the file that
data. Most modern file systems support sparse files, including most contain no data. Most modern file systems support sparse files,
UNIX file systems and NTFS, but notably not Apple's HFS+. Common including most UNIX file systems and NTFS, but notably not Apple's
examples of sparse files include VM OS/disk images, database files, HFS+. Common examples of sparse files include Virtual Machine (VM)
log files, and even checkpoint recovery files most commonly used by OS/disk images, database files, log files, and even checkpoint
the HPC community. recovery files most commonly used by the HPC community.
If an application reads a hole in a sparse file, the file system must If an application reads a hole in a sparse file, the file system must
returns all zeros to the application. For local data access there is returns all zeros to the application. For local data access there is
little penalty, but with NFS these zeroes must be transferred back to little penalty, but with NFS these zeroes must be transferred back to
the client. If an application uses the NFS client to read data into the client. If an application uses the NFS client to read data into
memory, this wastes time and bandwidth as the application waits for memory, this wastes time and bandwidth as the application waits for
the zeroes to be transferred. Once the zeroes arrive, they then the zeroes to be transferred.
steal memory or cache space from real data. To make matters worse,
if an application then proceeds to write data to another file system,
the zeros are written into the file, expanding the sparse file into a
full sized regular file. Beyond wasting disk space, this can
actually prevent large sparse files from ever being copied to another
storage location due to space limitations.
This document adds a new READPLUS operation to efficiently read from A sparse file is typically created by initializing the file to be all
sparse files by avoiding the transfer of all zero regions from the zeros - nothing is written to the data in the file, instead the hole
server to the client. READPLUS supports all the features of READ but is recorded in the metadata for the file. So a 8G disk image might
includes a minimal extension to support sparse files. In addition, be represented initially by a couple hundred bits in the inode and
the return value of READPLUS is now compatible with NFSv4.1 minor nothing on the disk. If the VM then writes 100M to a file in the
versioning rules and could support other future extensions without middle of the image, there would now be two holes represented in the
requiring yet another operation. READPLUS is guaranteed to perform metadata and 100M in the data.
no worse than READ, and can dramatically improve performance with
sparse files. READPLUS does not depend on pNFS protocol features, Other applications want to initialize a file to patterns other than
but can be used by pNFS to support sparse files. zero. The problem with initializing to zero is that it is often
difficult to distinguish a byte-range of initialized to all zeroes
from data corruption, since a pattern of zeroes is a probable pattern
for corruption. Instead, some applications, such as database
management systems, use pattern consisting of bytes or words of non-
zero values.
Besides reading sparse files and initializing them, applications
might want to hole punch, which is the deallocation of the data
blocks which back a region of the file. At such time, the affected
blocks are reinitialized to a pattern.
This section introduces a new operation to read patterns from a file,
READ_PLUS, and a new operation to both initialize patterns and to
punch pattern holes into a file, WRITE_PLUS. READ_PLUS supports all
the features of READ but includes an extension to support sparse
pattern files. In addition, the return value of READ_PLUS is now
compatible with NFSv4.1 minor versioning rules and could support
other future extensions without requiring yet another operation.
READ_PLUS is guaranteed to perform no worse than READ, and can
dramatically improve performance with sparse files. READ_PLUS does
not depend on pNFS protocol features, but can be used by pNFS to
support sparse files.
6.2. Terminology 6.2. Terminology
Regular file Regular file: An object of file type NF4REG or Regular file: An object of file type NF4REG or NF4NAMEDATTR.
NF4NAMEDATTR.
Sparse file Sparse File. A Regular file that contains one or more Sparse file: A Regular file that contains one or more Holes.
Holes.
Hole Hole. A byte range within a Sparse file that contains regions Hole: A byte range within a Sparse file that contains regions of all
of all zeroes. For block-based file systems, this could also be zeroes. For block-based file systems, this could also be an
an unallocated region of the file. unallocated region of the file.
Hole Threshold The minimum length of a Hole as determined by the
server. If a server chooses to define a Hole Threshold, then it
would not return hole information (nfs_readplusreshole) with a
hole_offset and hole_length that specify a range shorter than the
Hole Threshold.
6.3. Applications and Sparse Files 6.3. Applications and Sparse Files
Applications may cause an NFS client to read holes in a file for Applications may cause an NFS client to read holes in a file for
several reasons. This section describes three different application several reasons. This section describes three different application
workloads that cause the NFS client to transfer data unnecessarily. workloads that cause the NFS client to transfer data unnecessarily.
These workloads are simply examples, and there are probably many more These workloads are simply examples, and there are probably many more
workloads that are negatively impacted by sparse files. workloads that are negatively impacted by sparse files.
The first workload that can cause holes to be read is sequential The first workload that can cause holes to be read is sequential
skipping to change at page 54, line 42 skipping to change at page 51, line 32
The third workload is generated by applications that do not utilize The third workload is generated by applications that do not utilize
the NFS client cache, but instead use direct I/O and manage cached the NFS client cache, but instead use direct I/O and manage cached
data independently, e.g., databases. These applications may perform data independently, e.g., databases. These applications may perform
whole file caching with sparse files, which would mean that even the whole file caching with sparse files, which would mean that even the
holes will be transferred to the clients and cached. holes will be transferred to the clients and cached.
6.4. Overview of Sparse Files and NFSv4 6.4. Overview of Sparse Files and NFSv4
This proposal seeks to provide sparse file support to the largest This proposal seeks to provide sparse file support to the largest
number of NFS client and server implementations, and as such proposes number of NFS client and server implementations, and as such proposes
to add a new return code to the mandatory NFSv4.1 READPLUS operation to add a new return code to the mandatory NFSv4.1 READ_PLUS operation
instead of proposing additions or extensions of new or existing instead of proposing additions or extensions of new or existing
optional features (such as pNFS). optional features (such as pNFS).
As well, this document seeks to ensure that the proposed extensions As well, this document seeks to ensure that the proposed extensions
are simple and do not transfer data between the client and server are simple and do not transfer data between the client and server
unnecessarily. For example, one possible way to implement sparse unnecessarily. For example, one possible way to implement sparse
file read support would be to have the client, on the first hole file read support would be to have the client, on the first hole
encountered or at OPEN time, request a Data Region Map from the encountered or at OPEN time, request a Data Region Map from the
server. A Data Region Map would specify all zero and non-zero server. A Data Region Map would specify all zero and non-zero
regions in a file. While this option seems simple, it is less useful regions in a file. While this option seems simple, it is less useful
and can become inefficient and cumbersome for several reasons: and can become inefficient and cumbersome for several reasons:
o Data Region Maps can be large, and transferring them can reduce o Data Region Maps can be large, and transferring them can reduce
overall read performance. For example, VMware's .vmdk files can overall read performance. For example, VMware's .vmdk files can
have a file size of over 100 GBs and have a map well over several have a file size of over 100 GBs and have a map well over several
MBs. MBs.
o Data Region Maps can change frequently, and become invalidated on o Data Region Maps can change frequently, and become invalidated on
every write to the file. This can result the map being every write to the file. NFSv4 has a single change attribute,
transferred multiple times with each update to the file. For which means any change to any region of a file will invalidate all
example, a VM that updates a config file in its file system image Data Region Maps. This can result in the map being transferred
would invalidate the Data Region Map not only for itself, but for multiple times with each update to the file. For example, a VM
all other clients accessing the same file system image. that updates a config file in its file system image would
invalidate the Data Region Map not only for itself, but for all
other clients accessing the same file system image.
o Data Region Maps do not handle all zero-filled sections of the o Data Region Maps do not handle all zero-filled sections of the
file, reducing the effectiveness of the solution. While it may be file, reducing the effectiveness of the solution. While it may be
possible to modify the maps to handle zero-filled sections (at possible to modify the maps to handle zero-filled sections (at
possibly great effort to the server), it is almost impossible with possibly great effort to the server), it is almost impossible with
pNFS. With pNFS, the owner of the Data Region Map is the metadata pNFS. With pNFS, the owner of the Data Region Map is the metadata
server, which is not in the data path and has no knowledge of the server, which is not in the data path and has no knowledge of the
contents of a data region. contents of a data region.
Another way to handle holes is compression, but this not ideal since Another way to handle holes is compression, but this not ideal since
it requires all implementations to agree on a single compression it requires all implementations to agree on a single compression
algorithm and requires a fair amount of computational overhead. algorithm and requires a fair amount of computational overhead.
Note that supporting writing to a sparse file does not require Note that supporting writing to a sparse file does not require
changes to the protocol. Applications and/or NFS implementations can changes to the protocol. Applications and/or NFS implementations can
choose to ignore WRITE requests of all zeroes to the NFS server choose to ignore WRITE requests of all zeroes to the NFS server
without consequence. without consequence.
6.5. Operation 65: READPLUS 6.5. Operation 65: READ_PLUS
The section introduces a new read operation, named READPLUS, which The section introduces a new read operation, named READ_PLUS, which
allows NFS clients to avoid reading holes in a sparse file. READPLUS allows NFS clients to avoid reading holes in a sparse file.
is guaranteed to perform no worse than READ, and can dramatically READ_PLUS is guaranteed to perform no worse than READ, and can
improve performance with sparse files. dramatically improve performance with sparse files.
READPLUS supports all the features of the existing NFSv4.1 READ READ_PLUS supports all the features of the existing NFSv4.1 READ
operation [2] and adds a simple yet significant extension to the operation [2] and adds a simple yet significant extension to the
format of its response. The change allows the client to avoid format of its response. The change allows the client to avoid
returning all zeroes from a file hole, wasting computational and returning all zeroes from a file hole, wasting computational and
network resources and reducing performance. READPLUS uses a new network resources and reducing performance. READ_PLUS uses a new
result structure that tells the client that the result is all zeroes result structure that tells the client that the result is all zeroes
AND the byte-range of the hole in which the request was made. AND the byte-range of the hole in which the request was made.
Returning the hole's byte-range, and only upon request, avoids Returning the hole's byte-range, and only upon request, avoids
transferring large Data Region Maps that may be soon invalidated and transferring large Data Region Maps that may be soon invalidated and
contain information about a file that may not even be read in its contain information about a file that may not even be read in its
entirely. entirely.
A new read operation is required due to NFSv4.1 minor versioning A new read operation is required due to NFSv4.1 minor versioning
rules that do not allow modification of existing operation's rules that do not allow modification of existing operation's
arguments or results. READPLUS is designed in such a way to allow arguments or results. READ_PLUS is designed in such a way to allow
future extensions to the result structure. The same approach could future extensions to the result structure. The same approach could
be taken to extend the argument structure, but a good use case is be taken to extend the argument structure, but a good use case is
first required to make such a change. first required to make such a change.
6.5.1. ARGUMENT 6.5.1. ARGUMENT
struct COPY_NOTIFY4args { struct READ_PLUS4args {
/* CURRENT_FH: source file */ /* CURRENT_FH: file */
netloc4 cna_destination_server; stateid4 rpa_stateid;
offset4 rpa_offset;
count4 rpa_count;
}; };
6.5.2. RESULT 6.5.2. RESULT
union COPY_NOTIFY4res switch (nfsstat4 cnr_status) { struct read_plus_hole_info {
offset4 rphi_offset;
length4 rphi_length;
};
enum holeres4 {
HOLE_NOINFO = 0,
HOLE_INFO = 1
};
union read_plus_hole switch (holeres4 resop) {
case HOLE_INFO:
read_plus_hole_info rph_info;
case HOLE_NOINFO:
void;
};
enum read_plusrestype4 {
READ_OK = 0,
READ_HOLE = 1
};
union read_plus_data switch (read_plusrestype4 resop) {
case READ_OK:
opaque rpd_data<>;
case READ_HOLE:
read_plus_hole rpd_hole4;
};
struct READ_PLUS4resok {
bool rpr_eof;
read_plus_data rpr_data;
};
union READ_PLUS4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
nfstime4 cnr_lease_time; READ_PLUS4resok resok4;
netloc4 cnr_source_server<>;
default: default:
void; void;
}; };
6.5.3. DESCRIPTION 6.5.3. DESCRIPTION
The READPLUS operation is based upon the NFSv4.1 READ operation [2], The READ_PLUS operation is based upon the NFSv4.1 READ operation [2],
and similarly reads data from the regular file identified by the and similarly reads data from the regular file identified by the
current filehandle. current filehandle.
The client provides an offset of where the READPLUS is to start and a The client provides an offset of where the READ_PLUS is to start and
count of how many bytes are to be read. An offset of zero means to a count of how many bytes are to be read. An offset of zero means to
read data starting at the beginning of the file. If offset is read data starting at the beginning of the file. If offset is
greater than or equal to the size of the file, the status NFS4_OK is greater than or equal to the size of the file, the status NFS4_OK is
returned with nfs_readplusrestype4 set to READ_OK, data length set to returned with nfs_readplusrestype4 set to READ_OK, data length set to
zero, and eof set to TRUE. The READPLUS is subject to access zero, and eof set to TRUE. The READ_PLUS is subject to access
permissions checking. permissions checking.
If the client specifies a count value of zero, the READPLUS succeeds If the client specifies a count value of zero, the READ_PLUS succeeds
and returns zero bytes of data, again subject to access permissions and returns zero bytes of data, again subject to access permissions
checking. In all situations, the server may choose to return fewer checking. In all situations, the server may choose to return fewer
bytes than specified by the client. The client needs to check for bytes than specified by the client. The client needs to check for
this condition and handle the condition appropriately. this condition and handle the condition appropriately.
If the client specifies an offset and count value that is entirely If the client specifies an offset and count value that is entirely
contained within a hole of the file, the status NFS4_OK is returned contained within a hole of the file, the status NFS4_OK is returned
with nfs_readplusresok4 set to READ_HOLE, and if information is with nfs_readplusresok4 set to READ_HOLE, and if information is
available regarding the hole, a nfs_readplusreshole structure available regarding the hole, a nfs_readplusreshole structure
containing the offset and range of the entire hole. The containing the offset and range of the entire hole. The
nfs_readplusreshole structure is considered valid until the file is nfs_readplusreshole structure is considered valid until the file is
changed (detected via the change attribute). The server MUST provide changed (detected via the change attribute). The server MUST provide
the same semantics for nfs_readplusreshole as if the client read the the same semantics for nfs_readplusreshole as if the client read the
region and received zeroes; the implied holes contents lifetime MUST region and received zeroes; the implied holes contents lifetime MUST
be exactly the same as any other read data. be exactly the same as any other read data.
If the client specifies an offset and count value that begins in a If the client specifies an offset and count value that begins in a
non-hole of the file but extends into hole the server should return a non-hole of the file but extends into hole the server should return a
short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK, short read with status NFS4_OK, nfs_readplusresok4 set to READ_OK,
and data length set to the number of bytes returned. The client will and data length set to the number of bytes returned. The client will
then issue another READPLUS for the remaining bytes, which the server then issue another READ_PLUS for the remaining bytes, which the
will respond with information about the hole in the file. server will respond with information about the hole in the file.
If the server knows that the requested byte range is into a hole of If the server knows that the requested byte range is into a hole of
the file, but has no further information regarding the hole, it the file, but has no further information regarding the hole, it
returns a nfs_readplusreshole structure with holeres4 set to returns a nfs_readplusreshole structure with holeres4 set to
HOLE_NOINFO. HOLE_NOINFO.
If hole information is available on the server and can be returned to If hole information is available and can be returned to the client,
the client, the server returns a nfs_readplusreshole structure with the server returns a nfs_readplusreshole structure with the value of
the value of holeres4 to HOLE_INFO. The values of hole_offset and holeres4 to HOLE_INFO. The values of hole_offset and hole_length
hole_length define the byte-range for the current hole in the file. define the byte-range for the current hole in the file. These values
These values represent the information known to the server and may represent the information known to the server and may describe a
describe a byte-range smaller than the true size of the hole. byte-range smaller than the true size of the hole.
Except when special stateids are used, the stateid value for a Except when special stateids are used, the stateid value for a
READPLUS request represents a value returned from a previous byte- READ_PLUS request represents a value returned from a previous byte-
range lock or share reservation request or the stateid associated range lock or share reservation request or the stateid associated
with a delegation. The stateid identifies the associated owners if with a delegation. The stateid identifies the associated owners if
any and is used by the server to verify that the associated locks are any and is used by the server to verify that the associated locks are
still valid (e.g., have not been revoked). still valid (e.g., have not been revoked).
If the read ended at the end-of-file (formally, in a correctly formed If the read ended at the end-of-file (formally, in a correctly formed
READPLUS operation, if offset + count is equal to the size of the READ_PLUS operation, if offset + count is equal to the size of the
file), or the READPLUS operation extends beyond the size of the file file), or the READ_PLUS operation extends beyond the size of the file
(if offset + count is greater than the size of the file), eof is (if offset + count is greater than the size of the file), eof is
returned as TRUE; otherwise, it is FALSE. A successful READPLUS of returned as TRUE; otherwise, it is FALSE. A successful READ_PLUS of
an empty file will always return eof as TRUE. an empty file will always return eof as TRUE.
If the current filehandle is not an ordinary file, an error will be If the current filehandle is not an ordinary file, an error will be
returned to the client. In the case that the current filehandle returned to the client. In the case that the current filehandle
represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If
the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is
returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. returned. In all other cases, NFS4ERR_WRONG_TYPE is returned.
For a READPLUS with a stateid value of all bits equal to zero, the For a READ_PLUS with a stateid value of all bits equal to zero, the
server MAY allow the READPLUS to be serviced subject to mandatory server MAY allow the READ_PLUS to be serviced subject to mandatory
byte-range locks or the current share deny modes for the file. For a byte-range locks or the current share deny modes for the file. For a
READPLUS with a stateid value of all bits equal to one, the server READ_PLUS with a stateid value of all bits equal to one, the server
MAY allow READPLUS operations to bypass locking checks at the server. MAY allow READ_PLUS operations to bypass locking checks at the
server.
On success, the current filehandle retains its value. On success, the current filehandle retains its value.
6.5.4. IMPLEMENTATION 6.5.4. IMPLEMENTATION
If the server returns a "short read" (i.e., fewer data than requested If the server returns a "short read" (i.e., fewer data than requested
and eof is set to FALSE), the client should send another READPLUS to and eof is set to FALSE), the client should send another READ_PLUS to
get the remaining data. A server may return less data than requested get the remaining data. A server may return less data than requested
under several circumstances. The file may have been truncated by under several circumstances. The file may have been truncated by
another client or perhaps on the server itself, changing the file another client or perhaps on the server itself, changing the file
size from what the requesting client believes to be the case. This size from what the requesting client believes to be the case. This
would reduce the actual amount of data available to the client. It would reduce the actual amount of data available to the client. It
is possible that the server reduce the transfer size and so return a is possible that the server reduce the transfer size and so return a
short read result. Server resource exhaustion may also occur in a short read result. Server resource exhaustion may also occur in a
short read. short read.
If mandatory byte-range locking is in effect for the file, and if the If mandatory byte-range locking is in effect for the file, and if the
byte-range corresponding to the data to be read from the file is byte-range corresponding to the data to be read from the file is
WRITE_LT locked by an owner not associated with the stateid, the WRITE_LT locked by an owner not associated with the stateid, the
server will return the NFS4ERR_LOCKED error. The client should try server will return the NFS4ERR_LOCKED error. The client should try
to get the appropriate READ_LT via the LOCK operation before re- to get the appropriate READ_LT via the LOCK operation before re-
attempting the READPLUS. When the READPLUS completes, the client attempting the READ_PLUS. When the READ_PLUS completes, the client
should release the byte-range lock via LOCKU. should release the byte-range lock via LOCKU. In addition, the
server MUST return a nfs_readplusreshole structure with values of
hole_offset and hole_length that are within the owner's locked byte
range.
If another client has an OPEN_DELEGATE_WRITE delegation for the file If another client has an OPEN_DELEGATE_WRITE delegation for the file
being read, the delegation must be recalled, and the operation cannot being read, the delegation must be recalled, and the operation cannot
proceed until that delegation is returned or revoked. Except where proceed until that delegation is returned or revoked. Except where
this happens very quickly, one or more NFS4ERR_DELAY errors will be this happens very quickly, one or more NFS4ERR_DELAY errors will be
returned to requests made while the delegation remains outstanding. returned to requests made while the delegation remains outstanding.
Normally, delegations will not be recalled as a result of a READPLUS Normally, delegations will not be recalled as a result of a READ_PLUS
operation since the recall will occur as a result of an earlier OPEN. operation since the recall will occur as a result of an earlier OPEN.
However, since it is possible for a READPLUS to be done with a However, since it is possible for a READ_PLUS to be done with a
special stateid, the server needs to check for this case even though special stateid, the server needs to check for this case even though
the client should have done an OPEN previously. the client should have done an OPEN previously.
6.5.4.1. Additional pNFS Implementation Information 6.5.4.1. Additional pNFS Implementation Information
With pNFS, the semantics of using READPLUS remains the same. Any With pNFS, the semantics of using READ_PLUS remains the same. Any
data server MAY return a READ_HOLE result for a READPLUS request that data server MAY return a READ_HOLE result for a READ_PLUS request
it receives. that it receives.
When a data server chooses to return a READ_HOLE result, it has a
certain level of flexibility in how it fills out the
nfs_readplusreshole structure.
1. For a data server that cannot determine any hole information, the When a data server chooses to return a READ_HOLE result, it has the
data server SHOULD return HOLE_NOINFO. option of returning hole information for the data stored on that data
server (as defined by the data layout), but it MUST not return a
nfs_readplusreshole structure with a byte range that includes data
managed by another data server.
2. For a data server that can only obtain hole information for the 1. Data servers that cannot determine hole information SHOULD return
parts of the file stored on that data server, the data server HOLE_NOINFO.
SHOULD return HOLE_INFO and the byte range of the hole stored on
that data server.
3. For a data server that can obtain hole information for the entire 2. Data servers that can obtain hole information for the parts of
file without severe performance impact, it MAY return HOLE_INFO the file stored on that data server, the data server SHOULD
nd the byte range of the entire file hole. return HOLE_INFO and the byte range of the hole stored on that
data server.
In general, a data server should do its best to return as much A data server should do its best to return as much information about
information about a hole as is feasible. In general, pNFS server a hole as is feasible without having to contact the metadata server.
implementers should try ensure that data servers do not overload the If communication with the metadata server is required, then every
metadata server with requests for information. Therefore, if attempt should be taken to minimize the number of requests.
supplying global sparse information for a file to data servers can
overwhelm a metadata server, then data servers should use option 1 or
2 above.
When a pNFS client receives a READ_HOLE result and a non-empty If mandatory locking is enforced, then the data server must also
nfs_readplusreshole structure, it MAY use this information in ensure that to return only information for a Hole that is within the
conjunction with a valid layout for the file to determine the next owner's locked byte range.
data server for the next region of data that is not in a hole.
6.5.5. READPLUS with Sparse Files Example 6.5.5. READ_PLUS with Sparse Files Example
To see how the return value READ_HOLE will work, the following table To see how the return value READ_HOLE will work, the following table
describes a sparse file. For each byte range, the file contains describes a sparse file. For each byte range, the file contains
either non-zero data or a hole. either non-zero data or a hole. In addition, the server in this
example uses a hole threshold of 32K.
+-------------+----------+ +-------------+----------+
| Byte-Range | Contents | | Byte-Range | Contents |
+-------------+----------+ +-------------+----------+
| 0-31999 | Non-Zero | | 0-15999 | Hole |
| 16K-31999 | Non-Zero |
| 32K-255999 | Hole | | 32K-255999 | Hole |
| 256K-287999 | Non-Zero | | 256K-287999 | Non-Zero |
| 288K-353999 | Hole | | 288K-353999 | Hole |
| 354K-417999 | Non-Zero | | 354K-417999 | Non-Zero |
+-------------+----------+ +-------------+----------+
Table 3 Table 3
Under the given circumstances, if a client was to read the file from Under the given circumstances, if a client was to read the file from
beginning to end with a max read size of 64K, the following will be beginning to end with a max read size of 64K, the following will be
the result. This assumes the client has already opened the file and the result. This assumes the client has already opened the file and
acquired a valid stateid and just needs to issue READPLUS requests. acquired a valid stateid and just needs to issue READ_PLUS requests.
1. READPLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof = 1. READ_PLUS(s, 0, 64K) --> NFS_OK, readplusrestype4 = READ_OK, eof
false, data<>[32K]. Return a short read, as the last half of the = false, data<>[32K]. Return a short read, as the last half of
equest was all zeroes. the request was all zeroes. Note that the first hole is read
back as all zeros as it is below the hole threshhold.
2. READPLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 2. READ_PLUS(s, 32K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range nfs_readplusreshole(HOLE_INFO)(32K, 224K). The requested range
was all zeros, and the current hole begins at offset 32K and is was all zeros, and the current hole begins at offset 32K and is
224K in length. 224K in length.
3. READPLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 3. READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
eof = false, data<>[32K]. Return a short read, as the last half eof = false, data<>[32K]. Return a short read, as the last half
of the request was all zeroes. of the request was all zeroes.
4. READPLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 4. READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
nfs_readplusreshole(HOLE_INFO)(288K, 66K). nfs_readplusreshole(HOLE_INFO)(288K, 66K).
5. READPLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 5. READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
eof = true, data<>[64K]. eof = true, data<>[64K].
6.6. Related Work 6.6. Related Work
Solaris and ZFS support an extension to lseek(2) that allows Solaris and ZFS support an extension to lseek(2) that allows
applications to discover holes in a file. The values, SEEK_HOLE and applications to discover holes in a file. The values, SEEK_HOLE and
SEEK_DATA, allow clients to seek to the next hole or beginning of SEEK_DATA, allow clients to seek to the next hole or beginning of
data, respectively. data, respectively.
XFS supports the XFS_IOC_GETBMAP extended attribute, which returns XFS supports the XFS_IOC_GETBMAP extended attribute, which returns
the Data Region Map for a file. Clients can then use this the Data Region Map for a file. Clients can then use this
information to avoid reading holes in a file. information to avoid reading holes in a file.
NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows
applications to control whether empty regions of the file are applications to control whether empty regions of the file are
preallocated and filled in with zeros or simply left unallocated. preallocated and filled in with zeros or simply left unallocated.
6.7. Security Considerations 6.7. Other Proposed Designs
The additions to the NFS protocol for supporting sparse file reads 6.7.1. Multi-Data Server Hole Information
does not alter the security considerations of the NFSv4.1 protocol
[2].
6.8. IANA Considerations The current design prohibits pnfs data servers from returning hole
information for regions of a file that are not stored on that data
server. Having data servers return information regarding other data
servers changes the fundamental principal that all metadata
information comes from the metadata server.
There are no IANA considerations in this section. Here is a brief description if we did choose to support multi-data
server hole information:
For a data server that can obtain hole information for the entire
file without severe performance impact, it MAY return HOLE_INFO and
the byte range of the entire file hole. When a pNFS client receives
a READ_HOLE result and a non-empty nfs_readplusreshole structure, it
MAY use this information in conjunction with a valid layout for the
file to determine the next data server for the next region of data
that is not in a hole.
6.7.2. Data Result Array
If a single read request contains one or more Holes with a length
greater than the Sparse Threshold, the current design would return
results indicating a short read to the client. A client would then
send a series of read requests to the server to retrieve information
for the Holes and the remaining data. To avoid turning a single read
request into several exchanges between the client and server, the
server may need to choose a relatively large Sparse Threshold in
order to decrease the number of short reads it creates. A large
Sparse Threshold may miss many smaller holes, which in turn may
negate the benefits of sparse read support.
To avoid this situation, one option is to have the READ_PLUS
operation return information for multiple holes in a single return
value. This would allow several small holes to be described in a
single read response without requiring multliple exchanges between
the client and server.
One important item to consider with returning an array of data chunks
is its impact on RDMA, which may use different block sizes on the
client and server (among other things).
6.7.3. User-Defined Sparse Mask
Add mask (instead of just zeroes). Specified by server or client?
6.7.4. Allocated flag
A Hole on the server may be an allocated byte-range consisting of all
zeroes or may not be allocated at all. To ensure this information is
properly communicated to the client, it may be beneficial to add a
'alloc' flag to the HOLE_INFO section of nfs_readplusreshole. This
would allow an NFS client to copy a file from one file system to
another and have it more closely resemble the original.
6.7.5. Dense and Sparse pNFS File Layouts
The hole information returned form a data server must be understood
by pNFS clients using both Dense or Sparse file layout types. Does
the current READ_PLUS return value work for both layout types? Does
the data server know if it is using dense or sparse so that it can
return the correct hole_offset and hole_length values?
7. Security Considerations 7. Security Considerations
8. IANA Considerations 8. IANA Considerations
This section uses terms that are defined in [17]. This section uses terms that are defined in [17].
9. References 9. References
9.1. Normative References 9.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", March 1997. Levels", March 1997.
[2] Shepler, S., Eisler, M., and D. Noveck, "Network File System [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 Protocol", RFC 5661, (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
January 2010. January 2010.
[3] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS) [3] Haynes, T., "Network File System (NFS) Version 4 Minor Version
Block/Volume Layout", RFC 5663, January 2010. 2 External Data Representation Standard (XDR) Description",
March 2011.
[4] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel [4] Halevy, B., Welch, B., and J. Zelenka, "Object-Based Parallel
NFS (pNFS) Operations", RFC 5664, January 2010. NFS (pNFS) Operations", RFC 5664, January 2010.
[5] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform [5] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986,
January 2005. January 2005.
[6] Williams, N., "Remote Procedure Call (RPC) Security Version 3", [6] Williams, N., "Remote Procedure Call (RPC) Security Version 3",
draft-williams-rpcsecgssv3 (work in progress), 2008. draft-williams-rpcsecgssv3 (work in progress), 2008.
[7] Shepler, S., Eisler, M., and D. Noveck, "Network File System [7] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 External Data Representation (NFS) Version 4 Minor Version 1 External Data Representation
Standard (XDR) Description", RFC 5662, January 2010. Standard (XDR) Description", RFC 5662, January 2010.
[8] Haynes, T., "Network File System (NFS) Version 4 Minor Version [8] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS)
2 External Data Representation Standard (XDR) Description", Block/Volume Layout", RFC 5663, January 2010.
April 2011.
[9] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol [9] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
Specification", RFC 2203, September 1997. Specification", RFC 2203, September 1997.
9.2. Informative References 9.2. Informative References
[10] Haynes, T. and D. Noveck, "Network File System (NFS) version 4 [10] Haynes, T. and D. Noveck, "Network File System (NFS) version 4
Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress), Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
April 2011. March 2011.
[11] Eisler, M., "XDR: External Data Representation Standard", [11] Eisler, M., "XDR: External Data Representation Standard",
RFC 4506, May 2006. RFC 4506, May 2006.
[12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, [12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
"NSDB Protocol for Federated Filesystems", "NSDB Protocol for Federated Filesystems",
draft-ietf-nfsv4-federated-fs-protocol (Work In Progress), draft-ietf-nfsv4-federated-fs-protocol (Work In Progress),
2010. 2010.
[13] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, [13] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
skipping to change at page 63, line 9 skipping to change at page 62, line 33
[24] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On- [24] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On-
line Database", RFC 3232, January 2002. line Database", RFC 3232, January 2002.
[25] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964, [25] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964,
June 1996. June 1996.
[26] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, [26] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
C., Eisler, M., and D. Noveck, "Network File System (NFS) C., Eisler, M., and D. Noveck, "Network File System (NFS)
version 4 Protocol", RFC 3530, April 2003. version 4 Protocol", RFC 3530, April 2003.
[27] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
Oracle Database Concepts 11g Release 1 (11.1)", January 2011.
Appendix A. Acknowledgments Appendix A. Acknowledgments
For the pNFS Access Permissions Check, the original draft was by For the pNFS Access Permissions Check, the original draft was by
Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work
was influenced by discussions with Benny Halevy and Bruce Fields. A was influenced by discussions with Benny Halevy and Bruce Fields. A
review was done by Tom Haynes. review was done by Tom Haynes.
For the Sharing change attribute implementation details with NFSv4 For the Sharing change attribute implementation details with NFSv4
clients, the original draft was by Trond Myklebust. clients, the original draft was by Trond Myklebust.
 End of changes. 118 change blocks. 
523 lines changed or deleted 445 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/