draft-ietf-nfsv4-minorversion2-12.txt   draft-ietf-nfsv4-minorversion2-13.txt 
NFSv4 T. Haynes NFSv4 T. Haynes
Internet-Draft Editor Internet-Draft Editor
Intended status: Standards Track June 20, 2012 Intended status: Standards Track July 11, 2012
Expires: December 22, 2012 Expires: January 12, 2013
NFS Version 4 Minor Version 2 NFS Version 4 Minor Version 2
draft-ietf-nfsv4-minorversion2-12.txt draft-ietf-nfsv4-minorversion2-13.txt
Abstract Abstract
This Internet-Draft describes NFS version 4 minor version two, This Internet-Draft describes NFS version 4 minor version two,
focusing mainly on the protocol extensions made from NFS version 4 focusing mainly on the protocol extensions made from NFS version 4
minor version 0 and NFS version 4 minor version 1. Major extensions minor version 0 and NFS version 4 minor version 1. Major extensions
introduced in NFS version 4 minor version two include: Server-side introduced in NFS version 4 minor version two include: Server-side
Copy, Application I/O Advise, Space Reservations, Sparse Files, Copy, Application I/O Advise, Space Reservations, Sparse Files,
Application Data Blocks, and Labeled NFS. Application Data Blocks, and Labeled NFS.
skipping to change at page 1, line 41 skipping to change at page 1, line 41
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 22, 2012. This Internet-Draft will expire on January 12, 2013.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 7 skipping to change at page 3, line 7
modifications of such material outside the IETF Standards Process. modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other it for publication as an RFC or to translate it into languages other
than English. than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 6 1.1. The NFS Version 4 Minor Version 2 Protocol . . . . . . . 5
1.2. Scope of This Document . . . . . . . . . . . . . . . . . 6 1.2. Scope of This Document . . . . . . . . . . . . . . . . . 5
1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 6 1.3. NFSv4.2 Goals . . . . . . . . . . . . . . . . . . . . . . 5
1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 7 1.4. Overview of NFSv4.2 Features . . . . . . . . . . . . . . 6
1.4.1. Sparse Files . . . . . . . . . . . . . . . . . . . . . 7 1.4.1. Sparse Files . . . . . . . . . . . . . . . . . . . . . 6
1.4.2. Application I/O Advise . . . . . . . . . . . . . . . . 7 1.4.2. Application I/O Advise . . . . . . . . . . . . . . . . 6
1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 7 1.5. Differences from NFSv4.1 . . . . . . . . . . . . . . . . 6
2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 7 2. NFS Server-side Copy . . . . . . . . . . . . . . . . . . . . . 6
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 7 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 6
2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 8 2.2. Protocol Overview . . . . . . . . . . . . . . . . . . . . 7
2.2.1. Overview of Copy Operations . . . . . . . . . . . . . 9 2.2.1. Overview of Copy Operations . . . . . . . . . . . . . 7
2.2.2. Intra-Server Copy . . . . . . . . . . . . . . . . . . 9 2.2.2. Intra-Server Copy . . . . . . . . . . . . . . . . . . 8
2.2.3. Inter-Server Copy . . . . . . . . . . . . . . . . . . 10 2.2.3. Inter-Server Copy . . . . . . . . . . . . . . . . . . 9
2.2.4. Server-to-Server Copy Protocol . . . . . . . . . . . . 13 2.2.4. Server-to-Server Copy Protocol . . . . . . . . . . . . 12
2.3. Requirements for Operations . . . . . . . . . . . . . . . 15 2.3. Requirements for Operations . . . . . . . . . . . . . . . 14
2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 15 2.3.1. netloc4 - Network Locations . . . . . . . . . . . . . 14
2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 16 2.3.2. Copy Offload Stateids . . . . . . . . . . . . . . . . 15
2.4. Security Considerations . . . . . . . . . . . . . . . . . 17 2.4. Security Considerations . . . . . . . . . . . . . . . . . 16
2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 17 2.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 16
3. Support for Application IO Hints . . . . . . . . . . . . . . . 25 3. Support for Application IO Hints . . . . . . . . . . . . . . . 24
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 25 4. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2. POSIX Requirements . . . . . . . . . . . . . . . . . . . 26 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 24
3.3. Additional Requirements . . . . . . . . . . . . . . . . . 27 4.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 25
3.4. Security Considerations . . . . . . . . . . . . . . . . . 28 5. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 25
3.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 28 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 26
4. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 28 6. Application Data Block Support . . . . . . . . . . . . . . . . 28
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 28 6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 29
4.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 29 6.1.1. Data Block Representation . . . . . . . . . . . . . . 29
5. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 29 6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 30
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 29 6.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 30
6. Application Data Block Support . . . . . . . . . . . . . . . . 31 6.3. An Example of Detecting Corruption . . . . . . . . . . . 30
6.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 32 6.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 32
6.1.1. Data Block Representation . . . . . . . . . . . . . . 33 6.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 32
6.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 33 7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2. pNFS Considerations . . . . . . . . . . . . . . . . . . . 33 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 33
6.3. An Example of Detecting Corruption . . . . . . . . . . . 34 7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 34
6.4. Example of READ_PLUS . . . . . . . . . . . . . . . . . . 35 7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 34
6.5. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 36 7.3.1. Delegations . . . . . . . . . . . . . . . . . . . . . 35
7. Labeled NFS . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.3.2. Permission Checking . . . . . . . . . . . . . . . . . 35
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 36 7.3.3. Object Creation . . . . . . . . . . . . . . . . . . . 35
7.2. Definitions . . . . . . . . . . . . . . . . . . . . . . . 37 7.3.4. Existing Objects . . . . . . . . . . . . . . . . . . . 36
7.3. MAC Security Attribute . . . . . . . . . . . . . . . . . 38 7.3.5. Label Changes . . . . . . . . . . . . . . . . . . . . 36
7.3.1. Delegations . . . . . . . . . . . . . . . . . . . . . 38 7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 37
7.3.2. Permission Checking . . . . . . . . . . . . . . . . . 39 7.5. Discovery of Server Labeled NFS Support . . . . . . . . . 37
7.3.3. Object Creation . . . . . . . . . . . . . . . . . . . 39 7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 37
7.3.4. Existing Objects . . . . . . . . . . . . . . . . . . . 39 7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 38
7.3.5. Label Changes . . . . . . . . . . . . . . . . . . . . 39 7.6.2. Guest Mode . . . . . . . . . . . . . . . . . . . . . . 39
7.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 40 7.7. Security Considerations . . . . . . . . . . . . . . . . . 39
7.5. Discovery of Server Labeled NFS Support . . . . . . . . . 40
7.6. MAC Security NFS Modes of Operation . . . . . . . . . . . 41
7.6.1. Full Mode . . . . . . . . . . . . . . . . . . . . . . 41
7.6.2. Guest Mode . . . . . . . . . . . . . . . . . . . . . . 42
7.7. Security Considerations . . . . . . . . . . . . . . . . . 43
8. Sharing change attribute implementation details with NFSv4 8. Sharing change attribute implementation details with NFSv4
clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 43 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 40
9. Security Considerations . . . . . . . . . . . . . . . . . . . 44 9. Security Considerations . . . . . . . . . . . . . . . . . . . 40
10. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 44 10. Error Values . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.1. Error Definitions . . . . . . . . . . . . . . . . . . . . 44 10.1. Error Definitions . . . . . . . . . . . . . . . . . . . . 41
10.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 44 10.1.1. General Errors . . . . . . . . . . . . . . . . . . . . 41
10.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 45 10.1.2. Server to Server Copy Errors . . . . . . . . . . . . . 41
10.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 45 10.1.3. Labeled NFS Errors . . . . . . . . . . . . . . . . . . 42
11. New File Attributes . . . . . . . . . . . . . . . . . . . . . 46 11. New File Attributes . . . . . . . . . . . . . . . . . . . . . 42
11.1. New RECOMMENDED Attributes - List and Definition 11.1. New RECOMMENDED Attributes - List and Definition
References . . . . . . . . . . . . . . . . . . . . . . . 46 References . . . . . . . . . . . . . . . . . . . . . . . 42
11.2. Attribute Definitions . . . . . . . . . . . . . . . . . . 46 11.2. Attribute Definitions . . . . . . . . . . . . . . . . . . 43
12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 49 12. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . . 46
13. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 53 13. NFSv4.2 Operations . . . . . . . . . . . . . . . . . . . . . . 49
13.1. Operation 59: COPY - Initiate a server-side copy . . . . 53 13.1. Operation 59: COPY - Initiate a server-side copy . . . . 49
13.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . 61 13.2. Operation 60: COPY_ABORT - Cancel a server-side copy . . 57
13.3. Operation 61: COPY_NOTIFY - Notify a source server of 13.3. Operation 61: COPY_NOTIFY - Notify a source server of
a future copy . . . . . . . . . . . . . . . . . . . . . . 62 a future copy . . . . . . . . . . . . . . . . . . . . . . 58
13.4. Operation 62: COPY_REVOKE - Revoke a destination 13.4. Operation 62: COPY_REVOKE - Revoke a destination
server's copy privileges . . . . . . . . . . . . . . . . 63 server's copy privileges . . . . . . . . . . . . . . . . 59
13.5. Operation 63: COPY_STATUS - Poll for status of a 13.5. Operation 63: COPY_STATUS - Poll for status of a
server-side copy . . . . . . . . . . . . . . . . . . . . 64 server-side copy . . . . . . . . . . . . . . . . . . . . 60
13.6. Modification to Operation 42: EXCHANGE_ID - 13.6. Modification to Operation 42: EXCHANGE_ID -
Instantiate Client ID . . . . . . . . . . . . . . . . . . 66 Instantiate Client ID . . . . . . . . . . . . . . . . . . 62
13.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . 67 13.7. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . 63
13.8. Operation 67: IO_ADVISE - Application I/O access 13.8. Operation 67: IO_ADVISE - Application I/O access
pattern hints . . . . . . . . . . . . . . . . . . . . . . 70 pattern hints . . . . . . . . . . . . . . . . . . . . . . 66
13.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 76 13.9. Changes to Operation 51: LAYOUTRETURN . . . . . . . . . . 72
13.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 79 13.10. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 75
13.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 84 13.11. Operation 66: SEEK . . . . . . . . . . . . . . . . . . . 80
14. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 85 14. NFSv4.2 Callback Operations . . . . . . . . . . . . . . . . . 81
14.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that 14.1. Procedure 16: CB_ATTR_CHANGED - Notify Client that
the File's Attributes Changed . . . . . . . . . . . . . . 85 the File's Attributes Changed . . . . . . . . . . . . . . 81
14.2. Operation 15: CB_COPY - Report results of a 14.2. Operation 15: CB_COPY - Report results of a
server-side copy . . . . . . . . . . . . . . . . . . . . 86 server-side copy . . . . . . . . . . . . . . . . . . . . 82
15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 88 15. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 83
16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 88 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 83
16.1. Normative References . . . . . . . . . . . . . . . . . . 88 16.1. Normative References . . . . . . . . . . . . . . . . . . 83
16.2. Informative References . . . . . . . . . . . . . . . . . 89 16.2. Informative References . . . . . . . . . . . . . . . . . 84
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 85
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 90 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 86
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 91 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 86
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 91
1. Introduction 1. Introduction
1.1. The NFS Version 4 Minor Version 2 Protocol 1.1. The NFS Version 4 Minor Version 2 Protocol
The NFS version 4 minor version 2 (NFSv4.2) protocol is the third The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
minor version of the NFS version 4 (NFSv4) protocol. The first minor minor version of the NFS version 4 (NFSv4) protocol. The first minor
version, NFSv4.0, is described in [10] and the second minor version, version, NFSv4.0, is described in [10] and the second minor version,
NFSv4.1, is described in [2]. It follows the guidelines for minor NFSv4.1, is described in [2]. It follows the guidelines for minor
versioning that are listed in Section 11 of [10]. versioning that are listed in Section 11 of [10].
skipping to change at page 7, line 47 skipping to change at page 6, line 47
guidelines for minor versioning as operations in NFSv4.1 - i.e., they guidelines for minor versioning as operations in NFSv4.1 - i.e., they
may not be made REQUIRED. To support this, a new error code, may not be made REQUIRED. To support this, a new error code,
NFS4ERR_UNION_NOTSUPP, is introduced which allows the server to NFS4ERR_UNION_NOTSUPP, is introduced which allows the server to
communicate to the client that the operation is supported, but the communicate to the client that the operation is supported, but the
specific arm of the discriminated union is not. specific arm of the discriminated union is not.
2. NFS Server-side Copy 2. NFS Server-side Copy
2.1. Introduction 2.1. Introduction
This section describes a server-side copy feature for the NFS
protocol.
The server-side copy feature provides a mechanism for the NFS client The server-side copy feature provides a mechanism for the NFS client
to perform a file copy on the server without the data being to perform a file copy on the server without the data being
transmitted back and forth over the network. transmitted back and forth over the network. Without this feature,
an NFS client copies data from one location to another by reading the
Without this feature, an NFS client copies data from one location to data from the server over the network, and then writing the data back
another by reading the data from the server over the network, and over the network to the server. Using this server-side copy
then writing the data back over the network to the server. Using operation, the client is able to instruct the server to copy the data
this server-side copy operation, the client is able to instruct the locally without the data being sent back and forth over the network
server to copy the data locally without the data being sent back and unnecessarily.
forth over the network unnecessarily.
If the source object and destination object are on different file If the source object and destination object are on different file
servers, the file servers will communicate with one another to servers, the file servers will communicate with one another to
perform the copy operation. The server-to-server protocol by which perform the copy operation. The server-to-server protocol by which
this is accomplished is not defined in this document. this is accomplished is not defined in this document.
2.2. Protocol Overview 2.2. Protocol Overview
The server-side copy offload operations support both intra-server and The server-side copy offload operations support both intra-server and
inter-server file copies. An intra-server copy is a copy in which inter-server file copies. An intra-server copy is a copy in which
skipping to change at page 25, line 27 skipping to change at page 24, line 27
block these attacks. block these attacks.
2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 2.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3
The same techniques as Section 2.4.1.3, using unique URLs for each The same techniques as Section 2.4.1.3, using unique URLs for each
destination server, can be used for other protocols (e.g., HTTP [13] destination server, can be used for other protocols (e.g., HTTP [13]
and FTP [14]) as well. and FTP [14]) as well.
3. Support for Application IO Hints 3. Support for Application IO Hints
3.1. Introduction Applications can issue client I/O hints via posix_fadvise() [6] to
the NFS client. While this can help the NFS client optimize I/O and
Applications currently have several options for communicating I/O caching for a file, it does not allow the NFS server and its exported
access patterns to the NFS client. While this can help the NFS file system to do likewise. We add an IO_ADVISE procedure
client optimize I/O and caching for a file, it does not allow the NFS (Section 13.8) to communicate the client file access patterns to the
server and its exported file system to do likewise. Therefore, here NFS server. The NFS server upon receiving a IO_ADVISE operation MAY
we put forth a proposal for the NFSv4.2 protocol to allow choose to alter its I/O and caching behavior, but is under no
applications to communicate their expected behavior to the server. obligation to do so.
By communicating expected access pattern, e.g., sequential or random,
and data re-use behavior, e.g., data range will be read multiple
times and should be cached, the server will be able to better
understand what optimizations it should implement for access to a
file. For example, if a application indicates it will never read the
data more than once, then the file system can avoid polluting the
data cache and not cache the data.
The first application that can issue client I/O hints is the
posix_fadvise operation. For example, on Linux, when an application
uses posix_fadvise to specify a file will be read sequentially, Linux
doubles the readahead buffer size.
Another instance where applications provide an indication of their
desired I/O behavior is the use of direct I/O. By specifying direct
I/O, clients will no longer cache data, but this information is not
passed to the server, which will continue caching data.
Application specific NFS clients such as those used by hypervisors Application specific NFS clients such as those used by hypervisors
and databases can also leverage application hints to communicate and databases can also leverage application hints to communicate
their specialized requirements. their specialized requirements.
This section adds a new IO_ADVISE operation to communicate the client
file access patterns to the NFS server. The NFS server upon
receiving a IO_ADVISE operation MAY choose to alter its I/O and
caching behavior, but is under no obligation to do so.
3.2. POSIX Requirements
The first key requirement of the IO_ADVISE operation is to support
the posix_fadvise function [6], which is supported in Linux and many
other operating systems. Examples and guidance on how to use
posix_fadvise to improve performance can be found here [16].
posix_fadvise is defined as follows,
int posix_fadvise(int fd, off_t offset, off_t len, int advice);
The posix_fadvise() function shall advise the implementation on the
expected behavior of the application with respect to the data in the
file associated with the open file descriptor, fd, starting at offset
and continuing for len bytes. The specified range need not currently
exist in the file. If len is zero, all data following offset is
specified. The implementation may use this information to optimize
handling of the specified data. The posix_fadvise() function shall
have no effect on the semantics of other operations on the specified
data, although it may affect the performance of other operations.
The advice to be applied to the data is specified by the advice
parameter and may be one of the following values:
POSIX_FADV_NORMAL - Specifies that the application has no advice to
give on its behavior with respect to the specified data. It is
the default characteristic if no advice is given for an open file.
POSIX_FADV_SEQUENTIAL - Specifies that the application expects to
access the specified data sequentially from lower offsets to
higher offsets.
POSIX_FADV_RANDOM - Specifies that the application expects to access
the specified data in a random order.
POSIX_FADV_WILLNEED - Specifies that the application expects to
access the specified data in the near future.
POSIX_FADV_DONTNEED - Specifies that the application expects that it
will not access the specified data in the near future.
POSIX_FADV_NOREUSE - Specifies that the application expects to
access the specified data once and then not reuse it thereafter.
Upon successful completion, posix_fadvise() shall return zero;
otherwise, an error number shall be returned to indicate the error.
3.3. Additional Requirements
Many use cases exist for sending application I/O hints to the server
that cannot utilize the POSIX supported interface. This is because
some applications may benefit from additional hints not specified by
posix_fadvise, and some applications may not use POSIX altogether.
One use case is "Opportunistic Prefetch", which allows a stateid
holder to tell the server that it is possible that it will access the
specified data in the near future. This is similar to
POSIX_FADV_WILLNEED, but the client is unsure it will in fact read
the specified data, so the server should only prefetch the data if it
can be done at a marginal cost. For example, when a server receives
this hint, it could prefetch only the indirect blocks for a file
instead of all the data. This would still improve performance if the
client does read the data, but with less pressure on server memory.
An example use case for this hint is a database that reads in a
single record that points to additional records in either other areas
of the same file or different files located on the same or different
server. While it is likely that the application may access the
additional records, it is far from guaranteed. Therefore, the
database may issue an opportunistic prefetch (instead of
POSIX_FADV_WILLNEED) for the data in the other files pointed to by
the record.
Another use case is "Direct I/O", which allows a stated holder to
inform the server that it does not wish to cache data. Today, for
applications that only intend to read data once, the use of direct
I/O disables client caching, but does not affect server caching. By
caching data that will not be re-read, the server is polluting its
cache and possibly causing useful cached data to be evicted. By
informing the server of its expected I/O access, this situation can
be avoid. Direct I/O can be used in Linux and AIX via the open()
O_DIRECT parameter, in Solaris via the directio() function, and in
Windows via the CreateFile() FILE_FLAG_NO_BUFFERING flag.
Another use case is "Backward Sequential Read", which allows a stated
holder to inform the server that it intends to read the specified
data backwards, i.e., back the end to the beginning. This is
different than POSIX_FADV_SEQUENTIAL, whose implied intention was
that data will be read from beginning to end. This hint allows
servers to prefetch data at the end of the range first, and then
prefetch data sequentially in a backwards manner to the start of the
data range. One example of an application that can make use of this
hint is video editing.
3.4. Security Considerations
None.
3.5. IANA Considerations
The IO_ADVISE_type4 will be extended through an IANA registry.
4. Sparse Files 4. Sparse Files
4.1. Introduction 4.1. Introduction
A sparse file is a common way of representing a large file without A sparse file is a common way of representing a large file without
having to utilize all of the disk space for it. Consequently, a having to utilize all of the disk space for it. Consequently, a
sparse file uses less physical space than its size indicates. This sparse file uses less physical space than its size indicates. This
means the file contains 'holes', byte ranges within the file that means the file contains 'holes', byte ranges within the file that
contain no data. Most modern file systems support sparse files, contain no data. Most modern file systems support sparse files,
including most UNIX file systems and NTFS, but notably not Apple's including most UNIX file systems and NTFS, but notably not Apple's
skipping to change at page 31, line 37 skipping to change at page 28, line 11
The addition of this problem doesn't solve the problem of space being The addition of this problem doesn't solve the problem of space being
over-reported. However, over-reporting is better than under- over-reported. However, over-reporting is better than under-
reporting. reporting.
6. Application Data Block Support 6. Application Data Block Support
At the OS level, files are contained on disk blocks. Applications At the OS level, files are contained on disk blocks. Applications
are also free to impose structure on the data contained in a file and are also free to impose structure on the data contained in a file and
we can define an Application Data Block (ADB) to be such a structure. we can define an Application Data Block (ADB) to be such a structure.
From the application's viewpoint, it only wants to handle ADBs and From the application's viewpoint, it only wants to handle ADBs and
not raw bytes (see [17]). An ADB is typically comprised of two not raw bytes (see [16]). An ADB is typically comprised of two
sections: a header and data. The header describes the sections: a header and data. The header describes the
characteristics of the block and can provide a means to detect characteristics of the block and can provide a means to detect
corruption in the data payload. The data section is typically corruption in the data payload. The data section is typically
initialized to all zeros. initialized to all zeros.
The format of the header is application specific, but there are two The format of the header is application specific, but there are two
main components typically encountered: main components typically encountered:
1. An ADB Number (ADBN), which allows the application to determine 1. An ADB Number (ADBN), which allows the application to determine
which data block is being referenced. The ADBN is a logical which data block is being referenced. The ADBN is a logical
block number and is useful when the client is not storing the block number and is useful when the client is not storing the
blocks in contiguous memory. blocks in contiguous memory.
2. Fields to describe the state of the ADB and a means to detect 2. Fields to describe the state of the ADB and a means to detect
block corruption. For both pieces of data, a useful property is block corruption. For both pieces of data, a useful property is
that allowed values be unique in that if passed across the that allowed values be unique in that if passed across the
network, corruption due to translation between big and little network, corruption due to translation between big and little
endian architectures are detectable. For example, 0xF0DEDEF0 has endian architectures are detectable. For example, 0xF0DEDEF0 has
the same bit pattern in both architectures. the same bit pattern in both architectures.
Applications already impose structures on files [17] and detect Applications already impose structures on files [16] and detect
corruption in data blocks [18]. What they are not able to do is corruption in data blocks [17]. What they are not able to do is
efficiently transfer and store ADBs. To initialize a file with ADBs, efficiently transfer and store ADBs. To initialize a file with ADBs,
the client must send the full ADB to the server and that must be the client must send the full ADB to the server and that must be
stored on the server. When the application is initializing a file to stored on the server. When the application is initializing a file to
have the ADB structure, it could compress the ADBs to just the have the ADB structure, it could compress the ADBs to just the
information to necessary to later reconstruct the header portion of information to necessary to later reconstruct the header portion of
the ADB when the contents are read back. Using sparse file the ADB when the contents are read back. Using sparse file
techniques, the disk blocks described by would not be allocated. techniques, the disk blocks described by would not be allocated.
Unlike sparse file techniques, there would be a small cost to store Unlike sparse file techniques, there would be a small cost to store
the compressed header data. the compressed header data.
skipping to change at page 32, line 39 skipping to change at page 29, line 14
6.1. Generic Framework 6.1. Generic Framework
We want the representation of the ADB to be flexible enough to We want the representation of the ADB to be flexible enough to
support many different applications. The most basic approach is no support many different applications. The most basic approach is no
imposition of a block at all, which means we are working with the raw imposition of a block at all, which means we are working with the raw
bytes. Such an approach would be useful for storing holes, punching bytes. Such an approach would be useful for storing holes, punching
holes, etc. In more complex deployments, a server might be holes, etc. In more complex deployments, a server might be
supporting multiple applications, each with their own definition of supporting multiple applications, each with their own definition of
the ADB. One might store the ADBN at the start of the block and then the ADB. One might store the ADBN at the start of the block and then
have a guard pattern to detect corruption [19]. The next might store have a guard pattern to detect corruption [18]. The next might store
the ADBN at an offset of 100 bytes within the block and have no guard the ADBN at an offset of 100 bytes within the block and have no guard
pattern at all. I.e., existing applications might already have well pattern at all. I.e., existing applications might already have well
defined formats for their data blocks. defined formats for their data blocks.
The guard pattern can be used to represent the state of the block, to The guard pattern can be used to represent the state of the block, to
protect against corruption, or both. Again, it needs to be able to protect against corruption, or both. Again, it needs to be able to
be placed anywhere within the ADB. be placed anywhere within the ADB.
We need to be able to represent the starting offset of the block and We need to be able to represent the starting offset of the block and
the size of the block. Note that nothing prevents the application the size of the block. Note that nothing prevents the application
skipping to change at page 34, line 43 skipping to change at page 31, line 20
0xcafedead - This is the DATA state and indicates that real data 0xcafedead - This is the DATA state and indicates that real data
has been written to this block. has been written to this block.
0xe4e5c001 - This is the INDIRECT state and indicates that the 0xe4e5c001 - This is the INDIRECT state and indicates that the
block contains block counter numbers that are chained off of this block contains block counter numbers that are chained off of this
block. block.
0xba1ed4a3 - This is the INVALID state and indicates that the block 0xba1ed4a3 - This is the INVALID state and indicates that the block
contains data whose contents are garbage. contains data whose contents are garbage.
Finally, it also defines an 8 byte checksum [20] starting at byte 16 Finally, it also defines an 8 byte checksum [19] starting at byte 16
which applies to the remaining contents of the block. If the state which applies to the remaining contents of the block. If the state
is FREE, then that checksum is trivially zero. As such, the is FREE, then that checksum is trivially zero. As such, the
application has no need to transfer the checksum implicitly inside application has no need to transfer the checksum implicitly inside
the ADB - it need not make the transfer layer aware of the fact that the ADB - it need not make the transfer layer aware of the fact that
there is a checksum (see [18] for an example of checksums used to there is a checksum (see [17] for an example of checksums used to
detect corruption in application data blocks). detect corruption in application data blocks).
Corruption in each ADB can be detected thusly: Corruption in each ADB can be detected thusly:
o If the guard pattern is anything other than one of the allowed o If the guard pattern is anything other than one of the allowed
values, including all zeros. values, including all zeros.
o If the guard pattern is FREE and any other byte in the remainder o If the guard pattern is FREE and any other byte in the remainder
of the ADB is anything other than zero. of the ADB is anything other than zero.
skipping to change at page 38, line 7 skipping to change at page 34, line 37
access to an object. access to an object.
MAC-Aware: is a server which can transmit and store object labels. MAC-Aware: is a server which can transmit and store object labels.
MAC-Functional: is a client or server which is Labeled NFS enabled. MAC-Functional: is a client or server which is Labeled NFS enabled.
Such a system can interpret labels and apply policies based on the Such a system can interpret labels and apply policies based on the
security system. security system.
Multi-Level Security (MLS): is a traditional model where objects are Multi-Level Security (MLS): is a traditional model where objects are
given a sensitivity level (Unclassified, Secret, Top Secret, etc) given a sensitivity level (Unclassified, Secret, Top Secret, etc)
and a category set [21]. and a category set [20].
7.3. MAC Security Attribute 7.3. MAC Security Attribute
MAC models base access decisions on security attributes bound to MAC models base access decisions on security attributes bound to
subjects and objects. This information can range from a user subjects and objects. This information can range from a user
identity for an identity based MAC model, sensitivity levels for identity for an identity based MAC model, sensitivity levels for
Multi-level security, or a type for Type Enforcement. These models Multi-level security, or a type for Type Enforcement. These models
base their decisions on different criteria but the semantics of the base their decisions on different criteria but the semantics of the
security attribute remain the same. The semantics required by the security attribute remain the same. The semantics required by the
security attributes are listed below: security attributes are listed below:
skipping to change at page 38, line 34 skipping to change at page 35, line 15
o MUST provide the ability to enforce access control decisions both o MUST provide the ability to enforce access control decisions both
on the client and the server. on the client and the server.
o MUST not expose an object to either the client or server name o MUST not expose an object to either the client or server name
space before its security information has been bound to it. space before its security information has been bound to it.
NFSv4 implements the security attribute as a recommended attribute. NFSv4 implements the security attribute as a recommended attribute.
These attributes have a fixed format and semantics, which conflicts These attributes have a fixed format and semantics, which conflicts
with the flexible nature of the security attribute. To resolve this with the flexible nature of the security attribute. To resolve this
the security attribute consists of two components. The first the security attribute consists of two components. The first
component is a LFS as defined in [22] to allow for interoperability component is a LFS as defined in [21] to allow for interoperability
between MAC mechanisms. The second component is an opaque field between MAC mechanisms. The second component is an opaque field
which is the actual security attribute data. To allow for various which is the actual security attribute data. To allow for various
MAC models, NFSv4 should be used solely as a transport mechanism for MAC models, NFSv4 should be used solely as a transport mechanism for
the security attribute. It is the responsibility of the endpoints to the security attribute. It is the responsibility of the endpoints to
consume the security attribute and make access decisions based on consume the security attribute and make access decisions based on
their respective models. In addition, creation of objects through their respective models. In addition, creation of objects through
OPEN and CREATE allows for the security attribute to be specified OPEN and CREATE allows for the security attribute to be specified
upon creation. By providing an atomic create and set operation for upon creation. By providing an atomic create and set operation for
the security attribute it is possible to enforce the second and the security attribute it is possible to enforce the second and
fourth requirements. The recommended attribute FATTR4_SEC_LABEL (see fourth requirements. The recommended attribute FATTR4_SEC_LABEL (see
skipping to change at page 46, line 17 skipping to change at page 42, line 46
11.1. New RECOMMENDED Attributes - List and Definition References 11.1. New RECOMMENDED Attributes - List and Definition References
The list of new RECOMMENDED attributes appears in Table 2. The The list of new RECOMMENDED attributes appears in Table 2. The
meaning of the columns of the table are: meaning of the columns of the table are:
Name: The name of the attribute. Name: The name of the attribute.
Id: The number assigned to the attribute. In the event of conflicts Id: The number assigned to the attribute. In the event of conflicts
between the assigned number and [3], the latter is likely between the assigned number and [3], the latter is likely
authoritative, but should be resolved with Errata to this document authoritative, but should be resolved with Errata to this document
and/or [3]. See [23] for the Errata process. and/or [3]. See [22] for the Errata process.
Data Type: The XDR data type of the attribute. Data Type: The XDR data type of the attribute.
Acc: Access allowed to the attribute. Acc: Access allowed to the attribute.
R means read-only (GETATTR may retrieve, SETATTR may not set). R means read-only (GETATTR may retrieve, SETATTR may not set).
W means write-only (SETATTR may set, GETATTR may not retrieve). W means write-only (SETATTR may set, GETATTR may not retrieve).
R W means read/write (GETATTR may retrieve, SETATTR may set). R W means read/write (GETATTR may retrieve, SETATTR may set).
skipping to change at page 48, line 42 skipping to change at page 45, line 20
labelformat_spec4 slai_lfs; labelformat_spec4 slai_lfs;
opaque slai_data<>; opaque slai_data<>;
}; };
The FATTR4_SEC_LABEL contains an array of two components with the The FATTR4_SEC_LABEL contains an array of two components with the
first component being an LFS. It serves to provide the receiving end first component being an LFS. It serves to provide the receiving end
with the information necessary to translate the security attribute with the information necessary to translate the security attribute
into a form that is usable by the endpoint. Label Formats assigned into a form that is usable by the endpoint. Label Formats assigned
an LFS may optionally choose to include a Policy Identifier field to an LFS may optionally choose to include a Policy Identifier field to
allow for complex policy deployments. The LFS and Label Format allow for complex policy deployments. The LFS and Label Format
Registry are described in detail in [22]. The translation used to Registry are described in detail in [21]. The translation used to
interpret the security attribute is not specified as part of the interpret the security attribute is not specified as part of the
protocol as it may depend on various factors. The second component protocol as it may depend on various factors. The second component
is an opaque section which contains the data of the attribute. This is an opaque section which contains the data of the attribute. This
component is dependent on the MAC model to interpret and enforce. component is dependent on the MAC model to interpret and enforce.
In particular, it is the responsibility of the LFS specification to In particular, it is the responsibility of the LFS specification to
define a maximum size for the opaque section, slai_data<>. When define a maximum size for the opaque section, slai_data<>. When
creating or modifying a label for an object, the client needs to be creating or modifying a label for an object, the client needs to be
guaranteed that the server will accept a label that is sized guaranteed that the server will accept a label that is sized
correctly. By both client and server being part of a specific MAC correctly. By both client and server being part of a specific MAC
skipping to change at page 70, line 16 skipping to change at page 67, line 4
This document does not mandate the manner in which the server stores This document does not mandate the manner in which the server stores
ADBs sparsely for a file. It does assume that if ADBs are stored ADBs sparsely for a file. It does assume that if ADBs are stored
sparsely, then the server can detect when an INITIALIZE arrives that sparsely, then the server can detect when an INITIALIZE arrives that
will force a new ADB to start inside an existing ADB. For example, will force a new ADB to start inside an existing ADB. For example,
assume that ADBi has a adb_block_size of 4k and that an INITIALIZE assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
starts 1k inside ADBi. The server should [[Comment.3: Need to flesh starts 1k inside ADBi. The server should [[Comment.3: Need to flesh
this out. --TH]] this out. --TH]]
13.8. Operation 67: IO_ADVISE - Application I/O access pattern hints 13.8. Operation 67: IO_ADVISE - Application I/O access pattern hints
This section introduces a new operation, named IO_ADVISE, which
allows NFS clients to communicate application I/O access pattern
hints to the NFS server. This new operation will allow hints to be
sent to the server when applications use posix_fadvise, direct I/O,
or at any other point at which the client finds useful.
13.8.1. ARGUMENT 13.8.1. ARGUMENT
enum IO_ADVISE_type4 { enum IO_ADVISE_type4 {
IO_ADVISE4_NORMAL = 0, IO_ADVISE4_NORMAL = 0,
IO_ADVISE4_SEQUENTIAL = 1, IO_ADVISE4_SEQUENTIAL = 1,
IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2, IO_ADVISE4_SEQUENTIAL_BACKWARDS = 2,
IO_ADVISE4_RANDOM = 3, IO_ADVISE4_RANDOM = 3,
IO_ADVISE4_WILLNEED = 4, IO_ADVISE4_WILLNEED = 4,
IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5, IO_ADVISE4_WILLNEED_OPPORTUNISTIC = 5,
IO_ADVISE4_DONTNEED = 6, IO_ADVISE4_DONTNEED = 6,
skipping to change at page 71, line 21 skipping to change at page 67, line 44
union IO_ADVISE4res switch (nfsstat4 _status) { union IO_ADVISE4res switch (nfsstat4 _status) {
case NFS4_OK: case NFS4_OK:
IO_ADVISE4resok resok4; IO_ADVISE4resok resok4;
default: default:
void; void;
}; };
13.8.3. DESCRIPTION 13.8.3. DESCRIPTION
The IO_ADVISE operation sends an I/O access pattern hint to the The IO_ADVISE operation sends an I/O access pattern hint to the
server for the owner of stated for a given byte range specified by server for the owner of the stateid for a given byte range specified
iar_offset and iar_count. The byte range specified by iar_offset and by iar_offset and iar_count. The byte range specified by iar_offset
iar_count need not currently exist in the file, but the iar_hints and iar_count need not currently exist in the file, but the iar_hints
will apply to the byte range when it does exist. If iar_count is 0, will apply to the byte range when it does exist. If iar_count is 0,
all data following iar_offset is specified. The server MAY ignore all data following iar_offset is specified. The server MAY ignore
the advice. the advice.
The following are the possible hints: The following are the allowed hints for a stateid holder:
IO_ADVISE4_NORMAL Specifies that the application has no advice to IO_ADVISE4_NORMAL There is no advice to give, this is the default
give on its behavior with respect to the specified data. It is behavior.
the default characteristic if no advice is given.
IO_ADVISE4_SEQUENTIAL Specifies that the stated holder expects to IO_ADVISE4_SEQUENTIAL Expects to access the specified data
access the specified data sequentially from lower offsets to sequentially from lower offsets to higher offsets.
higher offsets.
IO_ADVISE4_SEQUENTIAL BACKWARDS Specifies that the stated holder IO_ADVISE4_SEQUENTIAL BACKWARDS Expects to access the specified data
expects to access the specified data sequentially from higher sequentially from higher offsets to lower offsets.
offsets to lower offsets.
IO_ADVISE4_RANDOM Specifies that the stated holder expects to access IO_ADVISE4_RANDOM Expects to access the specified data in a random
the specified data in a random order. order.
IO_ADVISE4_WILLNEED Specifies that the stated holder expects to IO_ADVISE4_WILLNEED Expects to access the specified data in the near
access the specified data in the near future. future.
IO_ADVISE4_WILLNEED_OPPORTUNISTIC Specifies that the stated holder IO_ADVISE4_WILLNEED_OPPORTUNISTIC Expects to possibly access the
expects to possibly access the data in the near future. This is a data in the near future. This is a speculative hint, and
speculative hint, and therefore the server should prefetch data or therefore the server should prefetch data or indirect blocks only
indirect blocks only if it can be done at a marginal cost. if it can be done at a marginal cost.
IO_ADVISE_DONTNEED Specifies that the stated holder expects that it IO_ADVISE_DONTNEED Expects that it will not access the specified
will not access the specified data in the near future. data in the near future.
IO_ADVISE_NOREUSE Specifies that the stated holder expects to access IO_ADVISE_NOREUSE Expects to access the specified data once and then
the specified data once and then not reuse it thereafter. not reuse it thereafter.
IO_ADVISE4_READ Specifies that the stated holder expects to read the IO_ADVISE4_READ Expects to read the specified data in the near
specified data in the near future. future.
IO_ADVISE4_WRITE Specifies that the stated holder expects to write IO_ADVISE4_WRITE Expects to write the specified data in the near
the specified data in the near future. future.
IO_ADVISE4_INIT_PROXIMITY The client has recently accessed the byte IO_ADVISE4_INIT_PROXIMITY Informs the server that the data in the
range in its own cache. This informs the server that the data in byte range remains important to the client.
the byte range remains important to the client. When the server
reaches resource exhaustion, knowing which data is more important
allows the server to make better choices about which data to, for
example purge from a cache, or move to secondary storage. It also
informs the server which delegations are more important, since if
delegations are working correctly, once delegated to a client, a
server might never receive another I/O request for the file.
The server will return success if the operation is properly formed, Since IO_ADVISE is a hint, a server SHOULD NOT return an error and
otherwise the server will return an error. The server MUST NOT invalidate a entire Compound request if one of the sent hints in
return an error if it does not recognize or does not support the iar_hints is not supported by the server. Also, the server MUST NOT
requested advice. This is also true even if the client sends return an error if the client sends contradictory hints to the
contradictory hints to the server, e.g., IO_ADVISE4_SEQUENTIAL and server, e.g., IO_ADVISE4_SEQUENTIAL and IO_ADVISE4_RANDOM in a single
IO_ADVISE4_RANDOM in a single IO_ADVISE operation. In this case, the IO_ADVISE operation. In these cases, the server MUST return success
server MUST return success and a ior_hints value that indicates the and a ior_hints value that indicates the hint it intends to
hint it intends to optimize. For contradictory hints, this may mean implement. This may mean simply returning IO_ADVISE4_NORMAL.
simply returning IO_ADVISE4_NORMAL for example.
The ior_hints returned by the server is primarily for debugging The ior_hints returned by the server is primarily for debugging
purposes since the server is under no obligation to carry out the purposes since the server is under no obligation to carry out the
hints that it describes in the ior_hints result. In addition, while hints that it describes in the ior_hints result. In addition, while
the server may have intended to implement the hints returned in the server may have intended to implement the hints returned in
ior_hints, as time progresses, the server may need to change its ior_hints, as time progresses, the server may need to change its
handling of a given file due to several reasons including, but not handling of a given file due to several reasons including, but not
limited to, memory pressure, additional IO_ADVISE hints sent by other limited to, memory pressure, additional IO_ADVISE hints sent by other
clients, and heuristically detected file access patterns. clients, and heuristically detected file access patterns.
The server MAY return different advice than what the client The server MAY return different advice than what the client
requested. If it does, then this might be due to one of several requested. If it does, then this might be due to one of several
conditions, including, but not limited to another client advising of conditions, including, but not limited to another client advising of
a different I/O access pattern; a different I/O access pattern from a different I/O access pattern; a different I/O access pattern from
another client that that the server has heuristically detected; or another client that that the server has heuristically detected; or
the server is not able to support the requested I/O access pattern, the server is not able to support the requested I/O access pattern,
perhaps due to a temporary resource limitation. perhaps due to a temporary resource limitation.
Each issuance of the IO_ADVISE operation overrides all previous Each issuance of the IO_ADVISE operation overrides all previous
issuances of IO_ADVISE for a given byte range. This effectively issuances of IO_ADVISE for a given byte range. This effectively
follows a strategy of last hint wins for a given stated and byte follows a strategy of last hint wins for a given stateid and byte
range. range.
Clients should assume that hints included in an IO_ADVISE operation Clients should assume that hints included in an IO_ADVISE operation
will be forgotten once the file is closed. will be forgotten once the file is closed.
13.8.4. IMPLEMENTATION 13.8.4. IMPLEMENTATION
The NFS client may choose to issue an IO_ADVISE operation to the The NFS client may choose to issue an IO_ADVISE operation to the
server in several different instances. server in several different instances.
The most obvious is in direct response to an application's execution The most obvious is in direct response to an application's execution
of posix_fadvise. In this case, IO_ADVISE4_WRITE and IO_ADVISE4_READ of posix_fadvise(). In this case, IO_ADVISE4_WRITE and
may be set based upon the type of file access specified when the file
was opened.
Another useful point would be when an application indicates it is
using direct I/O. Direct I/O may be specified at file open, in which
case a IO_ADVISE may be included in the same compound as the OPEN
operation with the IO_ADVISE4_NOREUSE flag set. Direct I/O may also
be specified separately, in which case a IO_ADVISE operation can be
sent to the server separately. As above, IO_ADVISE4_WRITE and
IO_ADVISE4_READ may be set based upon the type of file access IO_ADVISE4_READ may be set based upon the type of file access
specified when the file was opened. specified when the file was opened.
13.8.5. pNFS File Layout Data Type Considerations 13.8.5. IO_ADVISE4_INIT_PROXIMITY
The IO_ADVISE4_INIT_PROXIMITY hint is non-posix in origin and conveys
that the client has recently accessed the byte range in its own
cache. I.e., it has not accessed it on the server, but it has
locally. When the server reaches resource exhaustion, knowing which
data is more important allows the server to make better choices about
which data to, for example purge from a cache, or move to secondary
storage. It also informs the server which delegations are more
important, since if delegations are working correctly, once delegated
to a client and the client has read the content for that byte range,
a server might never receive another read request for that byte
range.
This hint is also useful in the case of NFS clients which are network
booting from a server. If the first client to be booted sends this
hint, then it keeps the cache warm for the remaining clients.
13.8.6. pNFS File Layout Data Type Considerations
The IO_ADVISE considerations for pNFS are very similar to the COMMIT The IO_ADVISE considerations for pNFS are very similar to the COMMIT
considerations for pNFS. That is, as with COMMIT, some NFS server considerations for pNFS. That is, as with COMMIT, some NFS server
implementations prefer IO_ADVISE be done on the DS, and some prefer implementations prefer IO_ADVISE be done on the DS, and some prefer
it be done on the MDS. it be done on the MDS.
So for the file's layout type, it is proposed that NFSv4.2 include an So for the file's layout type, it is proposed that NFSv4.2 include an
additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on additional hint NFL42_CARE_IO_ADVISE_THRU_MDS which is valid only on
NFSv4.2 or higher. Any file's layout obtained with NFSv4.1 MUST NOT NFSv4.2 or higher. Any file's layout obtained with NFSv4.1 MUST NOT
have NFL42_UFLG_IO_ADVISE_THRU_MDS set. Any file's layout obtained have NFL42_UFLG_IO_ADVISE_THRU_MDS set. Any file's layout obtained
with NFSv4.2 MAY have NFL42_UFLG_IO_ADVISE_THRU_MDS set. If the with NFSv4.2 MAY have NFL42_UFLG_IO_ADVISE_THRU_MDS set. If the
client does not implement IO_ADVISE, then it MUST ignore client does not implement IO_ADVISE, then it MUST ignore
NFL42_UFLG_IO_ADVISE_THRU_MDS. NFL42_UFLG_IO_ADVISE_THRU_MDS.
If NFL42_UFLG_IO_ADVISE_THRU_MDS is set, then if the client If NFL42_UFLG_IO_ADVISE_THRU_MDS is set, the client MUST send the
implements IO_ADVISE, then if it wants the DS to honor IO_ADVISE, the IO_ADVISE operation to the MDS in order for it to be honored by the
client MUST send the operation to the MDS, and the server will DS. Once the MDS receives the IO_ADVISE operation, it will
communicate the advice back each DS. If the client sends IO_ADVISE communicate the advice to each DS.
to the DS, then the server MAY return NFS4ERR_NOTSUPP.
If NFL42_UFLG_IO_ADVISE_THRU_MDS is not set, then this indicates to If NFL42_UFLG_IO_ADVISE_THRU_MDS is not set, then the client SHOULD
client that if wants to inform the server via IO_ADVISE of the send an IO_ADVISE operation to the appropriate DS for the specified
client's intended use of the file, then the client SHOULD send an byte range. While the client MAY always send IO_ADVISE to the MDS,
IO_ADVISE to each DS. While the client MAY always send IO_ADVISE to if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the client
the MDS, if the server has not set NFL42_UFLG_IO_ADVISE_THRU_MDS, the should expect that such an IO_ADVISE is futile. Note that a client
client should expect that such an IO_ADVISE is futile. Note that a SHOULD use the same set of arguments on each IO_ADVISE sent to a DS
client SHOULD use the same set of arguments on each IO_ADVISE sent to for the same open file reference.
a DS for the same open file reference.
The server is not required to support different advice for different The server is not required to support different advice for different
DS's with the same open file reference. DS's with the same open file reference.
13.8.5.1. Dense and Sparse Packing Considerations 13.8.6.1. Dense and Sparse Packing Considerations
The IO_ADVISE operation MUST use the iar_offset and byte range as The IO_ADVISE operation MUST use the iar_offset and byte range as
dictated by the presence or absence of NFL4_UFLG_DENSE. dictated by the presence or absence of NFL4_UFLG_DENSE.
E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS E.g., if NFL4_UFLG_DENSE is present, and a READ or WRITE to the DS
for iar_offset 0 really means iar_offset 10000 in the logical file, for iar_offset 0 really means iar_offset 10000 in the logical file,
then an IO_ADVISE for iar_offset 0 means iar_offset 10000. then an IO_ADVISE for iar_offset 0 means iar_offset 10000.
E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS E.g., if NFL4_UFLG_DENSE is absent, then a READ or WRITE to the DS
for iar_offset 0 really means iar_offset 0 in the logical file, then for iar_offset 0 really means iar_offset 0 in the logical file, then
skipping to change at page 75, line 41 skipping to change at page 72, line 9
If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and If neither of the flags NFL42_UFLG_IO_ADVISE_THRU_MDS and
NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request NFL4_UFLG_DENSE are set in the layout, then any IO_ADVISE request
sent to the data server with a byte range that overlaps stripe unit sent to the data server with a byte range that overlaps stripe unit
that the data server does not serve MUST NOT result in the status that the data server does not serve MUST NOT result in the status
NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and NFS4ERR_PNFS_IO_HOLE. Instead, the response SHOULD be successful and
if the server applies IO_ADVISE hints on any stripe units that if the server applies IO_ADVISE hints on any stripe units that
overlap with the specified range, those hints SHOULD be indicated in overlap with the specified range, those hints SHOULD be indicated in
the response. the response.
13.8.6. Number of Supported File Segments
In theory IO_ADVISE allows a client and server to support multiple
file segments, meaning that different, possibly overlapping, byte
ranges of the same open file reference will support different hints.
This is not practical, and in general the server will support just
one set of hints, and these will apply to the entire file. However,
there are some hints that very ephemeral, and are essentially amount
to one time instructions to the NFS server, which will be forgotten
momentarily after IO_ADVISE is executed.
The following hints will always apply to the entire file, regardless
of the specified byte range:
o IO_ADVISE4_NORMAL
o IO_ADVISE4_SEQUENTIAL
o IO_ADVISE4_SEQUENTIAL_BACKWARDS
o IO_ADVISE4_RANDOM
The following hints will always apply to specified byte range, and
will treated as one time instructions:
o IO_ADVISE4_WILLNEED
o IO_ADVISE4_WILLNEED_OPPORTUNISTIC
o IO_ADVISE4_DONTNEED
o IO_ADVISE4_NOREUSE
The following hints are modifiers to all other hints, and will apply
to the entire file and/or to a one time instruction on the specified
byte range:
o IO_ADVISE4_READ
o IO_ADVISE4_WRITE
13.9. Changes to Operation 51: LAYOUTRETURN 13.9. Changes to Operation 51: LAYOUTRETURN
13.9.1. Introduction 13.9.1. Introduction
In the pNFS description provided in [2], the client is not capable to In the pNFS description provided in [2], the client is not capable to
relay an error code from the DS to the MDS. In the specification of relay an error code from the DS to the MDS. In the specification of
the Objects-Based Layout protocol [9], use is made of the opaque the Objects-Based Layout protocol [9], use is made of the opaque
lrf_body field of the LAYOUTRETURN argument to do such a relaying of lrf_body field of the LAYOUTRETURN argument to do such a relaying of
error codes. In this section, we define a new data structure to error codes. In this section, we define a new data structure to
enable the passing of error codes back to the MDS and provide some enable the passing of error codes back to the MDS and provide some
skipping to change at page 77, line 32 skipping to change at page 73, line 8
for the MDS to consider such outages as being transistory. for the MDS to consider such outages as being transistory.
The existing LAYOUTRETURN operation is extended by introducing a new The existing LAYOUTRETURN operation is extended by introducing a new
data structure to report errors, layoutreturn_device_error4. Also, data structure to report errors, layoutreturn_device_error4. Also,
layoutreturn_device_error4 is introduced to enable an array of errors layoutreturn_device_error4 is introduced to enable an array of errors
to be reported. to be reported.
13.9.2. ARGUMENT 13.9.2. ARGUMENT
The ARGUMENT specification of the LAYOUTRETURN operation in section The ARGUMENT specification of the LAYOUTRETURN operation in section
18.44.1 of [2] is augmented by the following XDR code [24]: 18.44.1 of [2] is augmented by the following XDR code [23]:
struct layoutreturn_device_error4 { struct layoutreturn_device_error4 {
deviceid4 lrde_deviceid; deviceid4 lrde_deviceid;
nfsstat4 lrde_status; nfsstat4 lrde_status;
nfs_opnum4 lrde_opnum; nfs_opnum4 lrde_opnum;
}; };
struct layoutreturn_error_report4 { struct layoutreturn_error_report4 {
layoutreturn_device_error4 lrer_errors<>; layoutreturn_device_error4 lrer_errors<>;
}; };
skipping to change at page 88, line 10 skipping to change at page 83, line 13
of issue. of issue.
The CB_COPY operation may fail for the following reasons (this is a The CB_COPY operation may fail for the following reasons (this is a
partial list): partial list):
NFS4ERR_NOTSUPP: The copy offload operation is not supported by the NFS4ERR_NOTSUPP: The copy offload operation is not supported by the
NFS client receiving this request. NFS client receiving this request.
15. IANA Considerations 15. IANA Considerations
This section uses terms that are defined in [25]. This section uses terms that are defined in [24].
16. References 16. References
16.1. Normative References 16.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", March 1997. Levels", March 1997.
[2] Shepler, S., Eisler, M., and D. Noveck, "Network File System [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 Protocol", RFC 5661, (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
skipping to change at page 89, line 30 skipping to change at page 84, line 30
[13] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., [13] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
HTTP/1.1", RFC 2616, June 1999. HTTP/1.1", RFC 2616, June 1999.
[14] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, [14] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9,
RFC 959, October 1985. RFC 959, October 1985.
[15] Simpson, W., "PPP Challenge Handshake Authentication Protocol [15] Simpson, W., "PPP Challenge Handshake Authentication Protocol
(CHAP)", RFC 1994, August 1996. (CHAP)", RFC 1994, August 1996.
[16] VanDeBogart, S., Frost, C., and E. Kohler, "Reducing Seek [16] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
Overhead with Application-Directed Prefetching", Proceedings of
USENIX Annual Technical Conference , June 2009.
[17] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
Oracle Database Concepts 11g Release 1 (11.1)", January 2011. Oracle Database Concepts 11g Release 1 (11.1)", January 2011.
[18] Ashdown, L., "Chapter 15, Validating Database Files and [17] Ashdown, L., "Chapter 15, Validating Database Files and
Backups, of Oracle Database Backup and Recovery User's Guide Backups, of Oracle Database Backup and Recovery User's Guide
11g Release 1 (11.1)", August 2008. 11g Release 1 (11.1)", August 2008.
[19] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory [18] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory
Corruption of Solaris Internals", 2007. Corruption of Solaris Internals", 2007.
[20] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci- [19] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
Corruption in the Storage Stack", Proceedings of the 6th USENIX Corruption in the Storage Stack", Proceedings of the 6th USENIX
Symposium on File and Storage Technologies (FAST '08) , 2008. Symposium on File and Storage Technologies (FAST '08) , 2008.
[21] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide: [20] "Section 46.6. Multi-Level Security (MLS) of Deployment Guide:
Deployment, configuration and administration of Red Hat Deployment, configuration and administration of Red Hat
Enterprise Linux 5, Edition 6", 2011. Enterprise Linux 5, Edition 6", 2011.
[22] Quigley, D. and J. Lu, "Registry Specification for MAC Security [21] Quigley, D. and J. Lu, "Registry Specification for MAC Security
Label Formats", draft-quigley-label-format-registry (work in Label Formats", draft-quigley-label-format-registry (work in
progress), 2011. progress), 2011.
[23] ISEG, "IESG Processing of RFC Errata for the IETF Stream", [22] ISEG, "IESG Processing of RFC Errata for the IETF Stream",
2008. 2008.
[24] Eisler, M., "XDR: External Data Representation Standard", [23] Eisler, M., "XDR: External Data Representation Standard",
RFC 4506, May 2006. RFC 4506, May 2006.
[25] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA [24] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.
[25] VanDeBogart, S., Frost, C., and E. Kohler, "Reducing Seek
Overhead with Application-Directed Prefetching", Proceedings of
USENIX Annual Technical Conference , June 2009.
Appendix A. Acknowledgments Appendix A. Acknowledgments
For the pNFS Access Permissions Check, the original draft was by For the pNFS Access Permissions Check, the original draft was by
Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work
was influenced by discussions with Benny Halevy and Bruce Fields. A was influenced by discussions with Benny Halevy and Bruce Fields. A
review was done by Tom Haynes. review was done by Tom Haynes.
For the Sharing change attribute implementation details with NFSv4 For the Sharing change attribute implementation details with NFSv4
clients, the original draft was by Trond Myklebust. clients, the original draft was by Trond Myklebust.
 End of changes. 60 change blocks. 
382 lines changed or deleted 196 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/