draft-ietf-nfsv4-minorversion2-01.txt   draft-ietf-nfsv4-minorversion2-02.txt 
NFSv4 T. Haynes NFSv4 T. Haynes
Internet-Draft Editor Internet-Draft Editor
Intended status: Standards Track April 21, 2011 Intended status: Standards Track May 09, 2011
Expires: October 23, 2011 Expires: November 10, 2011
NFS Version 4 Minor Version 2 NFS Version 4 Minor Version 2
draft-ietf-nfsv4-minorversion2-01.txt draft-ietf-nfsv4-minorversion2-02.txt
Abstract Abstract
This Internet-Draft describes NFS version 4 minor version two, This Internet-Draft describes NFS version 4 minor version two,
focusing mainly on the protocol extensions made from NFS version 4 focusing mainly on the protocol extensions made from NFS version 4
minor version 0 and NFS version 4 minor version 1. Major extensions minor version 0 and NFS version 4 minor version 1. Major extensions
introduced in NFS version 4 minor version two include: Server-side introduced in NFS version 4 minor version two include: Server-side
Copy, Space Reservations, and Support for Sparse Files. Copy, Space Reservations, and Support for Sparse Files.
Requirements Language Requirements Language
skipping to change at page 1, line 40 skipping to change at page 1, line 40
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on October 23, 2011. This Internet-Draft will expire on November 10, 2011.
Copyright Notice Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 48 skipping to change at page 3, line 48
4.3.4. Operation 59: COPY - Initiate a server-side copy . . . 23 4.3.4. Operation 59: COPY - Initiate a server-side copy . . . 23
4.3.5. Operation 60: COPY_ABORT - Cancel a server-side 4.3.5. Operation 60: COPY_ABORT - Cancel a server-side
copy . . . . . . . . . . . . . . . . . . . . . . . . . 31 copy . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.6. Operation 63: COPY_STATUS - Poll for status of a 4.3.6. Operation 63: COPY_STATUS - Poll for status of a
server-side copy . . . . . . . . . . . . . . . . . . . 32 server-side copy . . . . . . . . . . . . . . . . . . . 32
4.3.7. Operation 15: CB_COPY - Report results of a 4.3.7. Operation 15: CB_COPY - Report results of a
server-side copy . . . . . . . . . . . . . . . . . . . 33 server-side copy . . . . . . . . . . . . . . . . . . . 33
4.3.8. Copy Offload Stateids . . . . . . . . . . . . . . . . 35 4.3.8. Copy Offload Stateids . . . . . . . . . . . . . . . . 35
4.4. Security Considerations . . . . . . . . . . . . . . . . . 35 4.4. Security Considerations . . . . . . . . . . . . . . . . . 35
4.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 35 4.4.1. Inter-Server Copy Security . . . . . . . . . . . . . . 35
4.5. IANA Considerations . . . . . . . . . . . . . . . . . . . 43 5. Application Data Block Support . . . . . . . . . . . . . . . . 43
5. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 43 5.1. Generic Framework . . . . . . . . . . . . . . . . . . . . 44
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1. Data Block Representation . . . . . . . . . . . . . . 45
5.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.2. Data Content . . . . . . . . . . . . . . . . . . . . . 45
5.2.1. Space Reservation . . . . . . . . . . . . . . . . . . 45 5.2. Operation 64: INITIALIZE . . . . . . . . . . . . . . . . . 45
5.2.2. Space freed on deletes . . . . . . . . . . . . . . . . 45 5.2.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.3. Operations and attributes . . . . . . . . . . . . . . 46 5.2.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.4. Attribute 77: space_reserve . . . . . . . . . . . . . 46 5.2.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 47
5.2.5. Attribute 78: space_freed . . . . . . . . . . . . . . 47 5.3. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 48
5.2.6. Attribute 79: max_hole_punch . . . . . . . . . . . . . 47 5.3.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate 5.3.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 49
blocks backing the file in the specified range. . . . 47 5.3.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 49
5.3. Security Considerations . . . . . . . . . . . . . . . . . 49 5.4. pNFS Considerations . . . . . . . . . . . . . . . . . . . 50
5.4. IANA Considerations . . . . . . . . . . . . . . . . . . . 49 5.5. An Example of Detecting Corruption . . . . . . . . . . . . 50
6. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.6. Example of READ_PLUS . . . . . . . . . . . . . . . . . . . 52
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 49 5.7. Zero Filled Holes . . . . . . . . . . . . . . . . . . . . 52
6.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 50 6. Space Reservation . . . . . . . . . . . . . . . . . . . . . . 52
6.3. Applications and Sparse Files . . . . . . . . . . . . . . 50 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 52
6.4. Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 51 6.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 54
6.5. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 52 6.2.1. Space Reservation . . . . . . . . . . . . . . . . . . 54
6.5.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 53 6.2.2. Space freed on deletes . . . . . . . . . . . . . . . . 54
6.5.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2.3. Operations and attributes . . . . . . . . . . . . . . 55
6.5.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 54 6.2.4. Attribute 77: space_reserved . . . . . . . . . . . . . 55
6.5.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 56 6.2.5. Attribute 78: space_freed . . . . . . . . . . . . . . 56
6.5.5. READ_PLUS with Sparse Files Example . . . . . . . . . 57 6.2.6. Attribute 79: max_hole_punch . . . . . . . . . . . . . 56
6.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . 58 6.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate
6.7. Other Proposed Designs . . . . . . . . . . . . . . . . . . 59 blocks backing the file in the specified range. . . . 56
6.7.1. Multi-Data Server Hole Information . . . . . . . . . . 59 7. Sparse Files . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.7.2. Data Result Array . . . . . . . . . . . . . . . . . . 59 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 57
6.7.3. User-Defined Sparse Mask . . . . . . . . . . . . . . . 60 7.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 58
6.7.4. Allocated flag . . . . . . . . . . . . . . . . . . . . 60 7.3. Applications and Sparse Files . . . . . . . . . . . . . . 59
6.7.5. Dense and Sparse pNFS File Layouts . . . . . . . . . . 60 7.4. Overview of Sparse Files and NFSv4 . . . . . . . . . . . . 60
7. Security Considerations . . . . . . . . . . . . . . . . . . . 60 7.5. Operation 65: READ_PLUS . . . . . . . . . . . . . . . . . 61
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 60 7.5.1. ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . 61
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.5.2. RESULT . . . . . . . . . . . . . . . . . . . . . . . . 62
9.1. Normative References . . . . . . . . . . . . . . . . . . . 60 7.5.3. DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 62
9.2. Informative References . . . . . . . . . . . . . . . . . . 61 7.5.4. IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . 64
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 62 7.5.5. READ_PLUS with Sparse Files Example . . . . . . . . . 65
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 63 7.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . 66
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 63 7.7. Other Proposed Designs . . . . . . . . . . . . . . . . . . 66
7.7.1. Multi-Data Server Hole Information . . . . . . . . . . 66
7.7.2. Data Result Array . . . . . . . . . . . . . . . . . . 67
7.7.3. User-Defined Sparse Mask . . . . . . . . . . . . . . . 67
7.7.4. Allocated flag . . . . . . . . . . . . . . . . . . . . 67
7.7.5. Dense and Sparse pNFS File Layouts . . . . . . . . . . 68
8. Security Considerations . . . . . . . . . . . . . . . . . . . 68
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 68
10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.1. Normative References . . . . . . . . . . . . . . . . . . . 68
10.2. Informative References . . . . . . . . . . . . . . . . . . 69
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 70
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 71
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 71
1. Introduction 1. Introduction
1.1. The NFS Version 4 Minor Version 2 Protocol 1.1. The NFS Version 4 Minor Version 2 Protocol
The NFS version 4 minor version 2 (NFSv4.2) protocol is the third The NFS version 4 minor version 2 (NFSv4.2) protocol is the third
minor version of the NFS version 4 (NFSv4) protocol. The first minor minor version of the NFS version 4 (NFSv4) protocol. The first minor
version, NFSv4.0, is described in [10] and the second minor version, version, NFSv4.0, is described in [10] and the second minor version,
NFSv4.1, is described in [2]. It follows the guidelines for minor NFSv4.1, is described in [2]. It follows the guidelines for minor
versioning that are listed in Section 11 of RFC 3530bis. versioning that are listed in Section 11 of RFC 3530bis.
skipping to change at page 43, line 36 skipping to change at page 43, line 36
The source server will therefore know that these NFSv4.1 operations The source server will therefore know that these NFSv4.1 operations
are being issued by the destination server identified in the are being issued by the destination server identified in the
COPY_NOTIFY. COPY_NOTIFY.
4.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3 4.4.1.4. Inter-Server Copy without ONC RPC and RPCSEC_GSSv3
The same techniques as Section 4.4.1.3, using unique URLs for each The same techniques as Section 4.4.1.3, using unique URLs for each
destination server, can be used for other protocols (e.g. HTTP [14] destination server, can be used for other protocols (e.g. HTTP [14]
and FTP [15]) as well. and FTP [15]) as well.
4.5. IANA Considerations 5. Application Data Block Support
This section has no actions for IANA. At the OS level, files are contained on disk blocks. Applications
are also free to impose structure on the data contained in a file and
we can define an Application Data Block (ADB) to be such a structure.
From the application's viewpoint, it only wants to handle ADBs and
not raw bytes (see [17]). An ADB is typically comprised of two
sections: a header and data. The header describes the
characteristics of the block and can provide a means to detect
corruption in the data payload. The data section is typically
initialized to all zeros.
5. Space Reservation The format of the header is application specific, but there are two
main components typically encountered:
5.1. Introduction 1. An ADB Number (ADBN), which allows the application to determine
which data block is being referenced. The ADBN is a logical
block number and is useful when the client is not storing the
blocks in contiguous memory.
2. Fields to describe the state of the ADB and a means to detect
block corruption. For both pieces of data, a useful property is
that allowed values be unique in that if passed across the
network, corruption due to translation between big and little
endian architectures are detectable. For example, 0xF0DEDEF0 has
the same bit pattern in both architectures.
Applications already impose structures on files [17] and detect
corruption in data blocks [18]. What they are not able to do is
efficiently transfer and store ADBs. To initialize a file with ADBs,
the client must send the full ADB to the server and that must be
stored on the server. When the application is initializing a file to
have the ADB structure, it could compress the ADBs to just the
information to necessary to later reconstruct the header portion of
the ADB when the contents are read back. Using sparse file
techniques, the disk blocks described by would not be allocated.
Unlike sparse file techniques, there would be a small cost to store
the compressed header data.
In this section, we are going to define a generic framework for an
ADB, present one approach to detecting corruption in a given ADB
implementation, and describe the model for how the client and server
can support efficient initialization of ADBs, reading of ADB holes,
punching holes in ADBs, and space reservation. Further, we need to
be able to extend this model to applications which do not support
ADBs, but wish to be able to handle sparse files, hole punching, and
space reservation.
5.1. Generic Framework
We want the representation of the ADB to be flexible enough to
support many different applications. The most basic approach is no
imposition of a block at all, which means we are working with the raw
bytes. Such an approach would be useful for storing holes, punching
holes, etc. In more complex deployments, a server might be
supporting multiple applications, each with their own definition of
the ADB. One might store the ADBN at the start of the block and then
have a guard pattern to detect corruption [19]. The next might store
the ADBN at an offset of 100 bytes within the block and have no guard
pattern at all. The point is that existing applications might
already have well defined formats for their data blocks.
The guard pattern can be used to represent the state of the block, to
protect against corruption, or both. Again, it needs to be able to
be placed anywhere within the ADB.
We need to be able to represent the starting offset of the block and
the size of the block. Note that nothing prevents the application
from defining different sized blocks in a file.
5.1.1. Data Block Representation
struct app_data_block4 {
offset4 adb_offset;
length4 adb_block_size;
length4 adb_block_count;
length4 adb_reloff_blocknum;
count4 adb_block_num;
length4 adb_reloff_pattern;
opaque adb_pattern<>;
};
The app_data_block4 structure captures the abstraction presented for
the ADB. The additional fields present are to allow the transmission
of adb_block_count ADBs at one time. We also use adb_block_num to
convey the ADBN of the first block in the sequence. Each ADB will
contain the same adb_pattern string.
As both adb_block_num and adb_pattern are optional, if either
adb_reloff_pattern or adb_reloff_blocknum is set to NFS4_UINT64_MAX,
then the corresponding field is not set in any of the ADB.
5.1.2. Data Content
/*
* Use an enum such that we can extend new types.
*/
enum data_content4 {
NFS4_CONTENT_DATA = 0,
NFS4_CONTENT_APP_BLOCK = 1,
NFS4_CONTENT_HOLE = 2
};
New operations might need to differentiate between wanting to access
data versus an ADB. Also, future minor versions might want to
introduce new data formats. This enumeration allows that to occur.
5.2. Operation 64: INITIALIZE
The server has no concept of the structure imposed by the
application. It is only when the application writes to a section of
the file does order get imposed. In order to detect corruption even
before the application utilizes the file, the application will want
to initialize a range of ADBs. It uses the INITIALIZE operation to
do so.
5.2.1. ARGUMENT
/*
* We use data_content4 in case we wish to
* extend new types later. Note that we
* are explicitly disallowing data.
*/
union initialize_arg4 switch (data_content4 content) {
case NFS4_CONTENT_APP_BLOCK:
app_data_block4 ia_adb;
case NFS4_CONTENT_HOLE:
length4 ia_hole_length;
default:
void;
};
struct INITIALIZE4args {
/* CURRENT_FH: file */
stateid4 ia_stateid;
stable_how4 ia_stable;
offset4 ia_offset;
initialize_arg4 ia_data<>;
};
5.2.2. RESULT
struct INITIALIZE4resok {
count4 ir_count;
stable_how4 ir_committed;
verifier4 ir_writeverf;
data_content4 ir_sparse;
};
union INITIALIZE4res switch (nfsstat4 status) {
case NFS4_OK:
INITIALIZE4resok resok4;
default:
void;
};
5.2.3. DESCRIPTION
When the client invokes the INITIALIZE operation, it has two desired
results:
1. The structure described by the app_data_block4 be imposed on the
file.
2. The contents described by the app_data_block4 be sparse.
If the server supports the INITIALIZE operation, it still might not
support sparse files. So if it receives the INITIALIZE operation,
then it MUST populate the contents of the file with the initialized
ADBs. In other words, if the server supports INITIALIZE, then it
supports the concept of ADBs. [[Comment.1: Do we want to support an
asynchronous INITIALIZE? Do we have to? --TH]]
If the data was already initialized, There are two interesting
scenarios:
1. The data blocks are allocated.
2. Initializing in the middle of an existing ADB.
If the data blocks were already allocated, then the INITIALIZE is a
hole punch operation. If INITIALIZE supports sparse files, then the
data blocks are to be deallocated. If not, then the data blocks are
to be rewritten in the indicated ADB format. [[Comment.2: Need to
document interaction between space reservation and hole punching?
--TH]]
Since the server has no knowledge of ADBs, it should not report
misaligned creation of ADBs. Even while it can detect them, it
cannot disallow them, as the application might be in the process of
changing the size of the ADBs. Thus the server must be prepared to
handle an INITIALIZE into an existing ADB.
This document does not mandate the manner in which the server stores
ADBs sparsely for a file. It does assume that if ADBs are stored
sparsely, then the server can detect when an INITIALIZE arrives that
will force a new ADB to start inside an existing ADB. For example,
assume that ADBi has a adb_block_size of 4k and that an INITIALIZE
starts 1k inside ADBi. The server should [[Comment.3: Need to flesh
this out. --TH]]
5.3. Operation 65: READ_PLUS
If the client sends a READ operation, it is explicitly stating that
it is not supporting sparse files. So if a READ occurs on a sparse
ADB, then the server must expand such ADBs to be raw bytes. If a
READ occurs in the middle of an ADB, the server can only send back
bytes starting from that offset.
Such an operation is inefficient for transfer of sparse sections of
the file. As such, READ is marked as OBSOLETE in NFSv4.2. Instead,
a client should issue READ_PLUS. Note that as the client has no a
priori knowledge of whether an ADB is present or not, it should
always use READ_PLUS.
5.3.1. ARGUMENT
struct READ_PLUS4args {
/* CURRENT_FH: file */
stateid4 rpa_stateid;
offset4 rpa_offset;
count4 rpa_count;
};
5.3.2. RESULT
union read_plus_content switch (data_content4 content) {
case NFS4_CONTENT_DATA:
opaque rpc_data<>;
case NFS4_CONTENT_APP_BLOCK:
app_data_block4 rpc_block;
case NFS4_CONTENT_HOLE:
length4 rpc_hole_length;
default:
void;
};
/*
* Allow a return of an array of contents.
*/
struct read_plus_res4 {
bool rpr_eof;
read_plus_content rpr_contents<>;
};
union READ_PLUS4res switch (nfsstat4 status) {
case NFS4_OK:
read_plus_res4 resok4;
default:
void;
};
5.3.3. DESCRIPTION
Over the given range, READ_PLUS will return all data and ADBs found
as an array of read_plus_content. It is possible to have consecutive
ADBs in the array as either different definitions of ADBs are present
or as the guard pattern changes.
Edge cases exist for ABDs which either begin before the rpa_offset
requested by the READ_PLUS or end after the rpa_count requested -
both of which may occur as not all applications which access the file
are aware of the main application imposing a format on the file
contents, i.e., tar, dd, cp, etc. READ_PLUS MUST retrieve whole
ADBs, but it need not retrieve an entire sequences of ADBs.
The server MUST return a whole ADB because if it does not, it must
expand that partial ADB before it sends it to the client. E.g., if
an ADB had a block size of 64k and the READ_PLUS was for 128k
starting at an offset of 32k inside the ADB, then the first 32k would
be converted to data.
5.4. pNFS Considerations
While this document does not mandate how sparse ADBs are recorded on
the server, it does make the assumption that such information is not
in the file. I.e., the information is metadata. As such, the
INITIALIZE operation is defined to be not supported by the DS - it
must be issued to the MDS. But since the client must not assume a
priori whether a read is sparse or not, the READ_PLUS operation MUST
be supported by both the DS and the MDS. I.e., the client might
impose on the MDS to asynchronously read the data from the DS.
Furthermore, each DS MUST not report to a client either a sparse ADB
or data which belongs to another DS. One implication of this
requirement is that the app_data_block4's adb_block_size MUST be
either be the stripe width or the stripe width must be an even
multiple of it.
The second implication here is that the DS must be able to use the
Control Protocol to determine from the MDS where the sparse ADBs
occur. [[Comment.4: Need to discuss what happens if after the file
is being written to and an INITIALIZE occurs? --TH]] Perhaps instead
of the DS pulling from the MDS, the MDS pushes to the DS? Thus an
INITIALIZE causes a new push? [[Comment.5: Still need to consider
race cases of the DS getting a WRITE and the MDS getting an
INITIALIZE. --TH]]
5.5. An Example of Detecting Corruption
In this section, we define an ADB format in which corruption can be
detected. Note that this is just one possible format and means to
detect corruption.
Consider a very basic implementation of an operating system's disk
blocks. A block is either data or it is an indirect block which
allows for files to be larger than one block. It is desired to be
able to initialize a block. Lastly, to quickly unlink a file, a
block can be marked invalid. The contents remain intact - which
would enable this OS application to undelete a file.
The application defines 4k sized data blocks, with an 8 byte block
counter occurring at offset 0 in the block, and with the guard
pattern occurring at offset 8 inside the block. Furthermore, the
guard pattern can take one of four states:
0xfeedface - This is the FREE state and indicates that the ADB
format has been applied.
0xcafedead - This is the DATA state and indicates that real data
has been written to this block.
0xe4e5c001 - This is the INDIRECT state and indicates that the
block contains block counter numbers that are chained off of this
block.
0xba1ed4a3 - This is the INVALID state and indicates that the block
contains data whose contents are garbage.
Finally, it also defines an 8 byte checksum [20] starting at byte 16
which applies to the remaining contents of the block. If the state
is FREE, then that checksum is trivially zero. As such, the
application has no need to transfer the checksum implicitly inside
the ADB - it need not make the transfer layer aware of the fact that
there is a checksum (see [18] for an example of checksums used to
detect corruption in application data blocks).
Corruption in each ADB can be detected thusly:
o If the guard pattern is anything other than one of the allowed
values, including all zeros.
o If the guard pattern is FREE and any other byte in the remainder
of the ADB is anything other than zero.
o If the guard pattern is anything other than FREE, then if the
stored checksum does not match the computed checksum.
o If the guard pattern is INDIRECT and one of the stored indirect
block numbers has a value greater than the number of ADBs in the
file.
o If the guard pattern is INDIRECT and one of the stored indirect
block numbers is a duplicate of another stored indirect block
number.
As can be seen, the application can detect errors based on the
combination of the guard pattern state and the checksum. But also,
the application can detect corruption based on the state and the
contents of the ADB. This last point is important in validating the
minimum amount of data we incorporated into our generic framework.
I.e., the guard pattern is sufficient in allowing applications to
design their own corruption detection.
Finally, it is important to note that none of these corruption checks
occur in the transport layer. The server and client components are
totally unaware of the file format and might report everything as
being transferred correctly even in the case the application detects
corruption.
5.6. Example of READ_PLUS
The hypothetical application presented in Section 5.5 can be used to
illustrate how READ_PLUS would return an array of results. A file is
created and initialized with 100 4k ADBs in the FREE state:
INITIALIZE {0, 4k, 100, 0, 0, 8, 0xfeedface}
Further, assume the application writes a single ADB at 16k, changing
the guard pattern to 0xcafedead, we would then have in memory:
0 -> (16k - 1) : 4k, 4, 0, 0, 8, 0xfeedface
16k -> (20k - 1) : 00 00 00 05 ca fe de ad XX XX ... XX XX
20k -> 400k : 4k, 95, 0, 6, 0xfeedface
And when the client did a READ_PLUS of 64k at the start of the file,
it would get back a result of an ADB, some data, and a final ADB:
ADB {0, 4, 0, 0, 8, 0xfeedface}
data 4k
ADB {20k, 4k, 59, 0, 6, 0xfeedface}
5.7. Zero Filled Holes
As applications are free to define the structure of an ADB, it is
trivial to define an ADB which supports zero filled holes. Such a
case would encompass the traditional definitions of a sparse file and
hole punching. For example, to punch a 64k hole, starting at 100M,
into an existing file which has no ADB structure:
INITIALIZE {100M, 64k, 1, NFS4_UINT64_MAX,
0, NFS4_UINT64_MAX, 0x0}
6. Space Reservation
6.1. Introduction
This section describes a set of operations that allow applications This section describes a set of operations that allow applications
such as hypervisors to reserve space for a file, report the amount of such as hypervisors to reserve space for a file, report the amount of
actual disk space a file occupies and freeup the backing space of a actual disk space a file occupies and freeup the backing space of a
file when it is not required. file when it is not required.
In virtualized environments, virtual disk files are often stored on In virtualized environments, virtual disk files are often stored on
NFS mounted volumes. Since virtual disk files represent the hard NFS mounted volumes. Since virtual disk files represent the hard
disks of virtual machines, hypervisors often have to guarantee disks of virtual machines, hypervisors often have to guarantee
certain properties for the file. certain properties for the file.
skipping to change at page 44, line 38 skipping to change at page 53, line 41
Since virtual disks represent a hard drive in a virtual machine, a Since virtual disks represent a hard drive in a virtual machine, a
virtual disk can be viewed as a filesystem within a file. Since not virtual disk can be viewed as a filesystem within a file. Since not
all blocks within a filesystem are in use, there is an opportunity to all blocks within a filesystem are in use, there is an opportunity to
reclaim blocks that are no longer in use. A call to deallocate reclaim blocks that are no longer in use. A call to deallocate
blocks could result in better space efficiency. Lesser space MAY be blocks could result in better space efficiency. Lesser space MAY be
consumed for backups after block deallocation. consumed for backups after block deallocation.
We propose the following operations and attributes for the We propose the following operations and attributes for the
aforementioned use cases: aforementioned use cases:
space_reserve This attribute specifies whether the blocks backing space_reserved This attribute specifies whether the blocks backing
the file have been preallocated. the file have been preallocated.
space_freed This attribute specifies the space freed when a file is space_freed This attribute specifies the space freed when a file is
deleted, taking block sharing into consideration. deleted, taking block sharing into consideration.
max_hole_punch This attribute specifies the maximum sized hole that max_hole_punch This attribute specifies the maximum sized hole that
can be punched on the filesystem. can be punched on the filesystem.
HOLE_PUNCH This operation zeroes and/or deallocates the blocks HOLE_PUNCH This operation zeroes and/or deallocates the blocks
backing a region of the file. backing a region of the file.
5.2. Use Cases 6.2. Use Cases
5.2.1. Space Reservation 6.2.1. Space Reservation
Some applications require that once a file of a certain size is Some applications require that once a file of a certain size is
created, writes to that file never fail with an out of space created, writes to that file never fail with an out of space
condition. One such example is that of a hypervisor writing to a condition. One such example is that of a hypervisor writing to a
virtual disk. An out of space condition while writing to virtual virtual disk. An out of space condition while writing to virtual
disks would mean that the virtual machine would need to be frozen. disks would mean that the virtual machine would need to be frozen.
Currently, in order to achieve such a guarantee, applications zero Currently, in order to achieve such a guarantee, applications zero
the entire file. The initial zeroing allocates the backing blocks the entire file. The initial zeroing allocates the backing blocks
and all subsequent writes are overwrites of already allocated blocks. and all subsequent writes are overwrites of already allocated blocks.
This approach is not only inefficient in terms of the amount of I/O This approach is not only inefficient in terms of the amount of I/O
done, it is also not guaranteed to work on filesystems that are log done, it is also not guaranteed to work on filesystems that are log
structured or deduplicated. An efficient way of guaranteeing space structured or deduplicated. An efficient way of guaranteeing space
reservation would be beneficial to such applications. reservation would be beneficial to such applications.
If the space_reserved attribute is set on a file, it is guaranteed If the space_reserved attribute is set on a file, it is guaranteed
that writes that do not grow the file will not fail with that writes that do not grow the file will not fail with
NFSERR_NOSPC. NFSERR_NOSPC.
5.2.2. Space freed on deletes 6.2.2. Space freed on deletes
Currently, files in NFS have two size attributes: Currently, files in NFS have two size attributes:
size The logical file size of the file. size The logical file size of the file.
space_used The size in bytes that the file occupies on disk. space_used The size in bytes that the file occupies on disk.
While these attributes are sufficient for space accounting in While these attributes are sufficient for space accounting in
traditional filesystems, they prove to be inadequate in modern traditional filesystems, they prove to be inadequate in modern
filesystems that support block sharing. In such filesystems, filesystems that support block sharing. In such filesystems,
skipping to change at page 46, line 23 skipping to change at page 55, line 26
to the given file that would be freed on its deletion. In the to the given file that would be freed on its deletion. In the
example, both A and B would report space_freed as 4 * BLOCK_SIZE and example, both A and B would report space_freed as 4 * BLOCK_SIZE and
space_used as 10 * BLOCK_SIZE. If A is deleted, B will report space_used as 10 * BLOCK_SIZE. If A is deleted, B will report
space_freed as 10 * BLOCK_SIZE as the deletion of B would result in space_freed as 10 * BLOCK_SIZE as the deletion of B would result in
the deallocation of all 10 blocks. the deallocation of all 10 blocks.
The addition of this problem doesn't solve the problem of space being The addition of this problem doesn't solve the problem of space being
over-reported. However, over-reporting is better than under- over-reported. However, over-reporting is better than under-
reporting. reporting.
5.2.3. Operations and attributes 6.2.3. Operations and attributes
In the sections that follow, one operation and three attributes are In the sections that follow, one operation and three attributes are
defined that together provide the space management facilities defined that together provide the space management facilities
outlined earlier in the document. The operation is intended to be outlined earlier in the document. The operation is intended to be
OPTIONAL and the attributes RECOMMENDED as defined in section 17 of OPTIONAL and the attributes RECOMMENDED as defined in section 17 of
[2]. [2].
5.2.4. Attribute 77: space_reserve 6.2.4. Attribute 77: space_reserved
The space_reserve attribute is a read/write attribute of type The space_reserve attribute is a read/write attribute of type
boolean. It is a per file attribute. When the space_reserved boolean. It is a per file attribute. When the space_reserved
attribute is set via SETATTR, the server must ensure that there is attribute is set via SETATTR, the server must ensure that there is
disk space to accommodate every byte in the file before it can return disk space to accommodate every byte in the file before it can return
success. If the server cannot guarantee this, it must return success. If the server cannot guarantee this, it must return
NFS4ERR_NOSPC. NFS4ERR_NOSPC.
If the client tries to grow a file which has the space_reserved If the client tries to grow a file which has the space_reserved
attribute set, the server must guarantee that there is disk space to attribute set, the server must guarantee that there is disk space to
skipping to change at page 47, line 12 skipping to change at page 56, line 15
The value of space_reserved can be obtained at any time through The value of space_reserved can be obtained at any time through
GETATTR. GETATTR.
In order to avoid ambiguity, the space_reserve bit cannot be set In order to avoid ambiguity, the space_reserve bit cannot be set
along with the size bit in SETATTR. Increasing the size of a file along with the size bit in SETATTR. Increasing the size of a file
with space_reserve set will fail if space reservation cannot be with space_reserve set will fail if space reservation cannot be
guaranteed for the new size. If the file size is decreased, space guaranteed for the new size. If the file size is decreased, space
reservation is only guaranteed for the new size and the extra blocks reservation is only guaranteed for the new size and the extra blocks
backing the file can be released. backing the file can be released.
5.2.5. Attribute 78: space_freed 6.2.5. Attribute 78: space_freed
space_freed gives the number of bytes freed if the file is deleted. space_freed gives the number of bytes freed if the file is deleted.
This attribute is read only and is of type length4. It is a per file This attribute is read only and is of type length4. It is a per file
attribute. attribute.
5.2.6. Attribute 79: max_hole_punch 6.2.6. Attribute 79: max_hole_punch
max_hole_punch specifies the maximum size of a hole that the max_hole_punch specifies the maximum size of a hole that the
HOLE_PUNCH operation can handle. This attribute is read only and of HOLE_PUNCH operation can handle. This attribute is read only and of
type length4. It is a per filesystem attribute. This attribute MUST type length4. It is a per filesystem attribute. This attribute MUST
be implemented if HOLE_PUNCH is implemented. be implemented if HOLE_PUNCH is implemented.
5.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate blocks backing 6.2.7. Operation 64: HOLE_PUNCH - Zero and deallocate blocks backing
the file in the specified range. the file in the specified range.
5.2.7.1. ARGUMENT WARNING: Most of this section is now obsolete. Parts of it need to
be scavanged for the ADB discussion, but for the most part, it cannot
struct HOLE_PUNCH4args { be trusted.
/* CURRENT_FH: file */
offset4 hpa_offset;
length4 hpa_count;
};
5.2.7.2. RESULT
struct HOLE_PUNCH4res {
nfsstat4 hpr_status;
};
5.2.7.3. DESCRIPTION 6.2.7.1. DESCRIPTION
Whenever a client wishes to deallocate the blocks backing a Whenever a client wishes to deallocate the blocks backing a
particular region in the file, it calls the HOLE_PUNCH operation with particular region in the file, it calls the HOLE_PUNCH operation with
the current filehandle set to the filehandle of the file in question, the current filehandle set to the filehandle of the file in question,
start offset and length in bytes of the region set in hpa_offset and start offset and length in bytes of the region set in hpa_offset and
hpa_count respectively. All further reads to this region MUST return hpa_count respectively. All further reads to this region MUST return
zeros until overwritten. The filehandle specified must be that of a zeros until overwritten. The filehandle specified must be that of a
regular file. regular file.
Situations may arise where hpa_offset and/or hpa_offset + hpa_count Situations may arise where hpa_offset and/or hpa_offset + hpa_count
skipping to change at page 49, line 5 skipping to change at page 57, line 37
NFS4ERR_NOTSUPP The Hole punch operations are not supported by the NFS4ERR_NOTSUPP The Hole punch operations are not supported by the
NFS server receiving this request. NFS server receiving this request.
NFS4ERR_DIR The current filehandle is of type NF4DIR. NFS4ERR_DIR The current filehandle is of type NF4DIR.
NFS4ERR_SYMLINK The current filehandle is of type NF4LNK. NFS4ERR_SYMLINK The current filehandle is of type NF4LNK.
NFS4ERR_WRONG_TYPE The current filehandle does not designate an NFS4ERR_WRONG_TYPE The current filehandle does not designate an
ordinary file. ordinary file.
5.3. Security Considerations 7. Sparse Files
There are no security considerations for this section.
5.4. IANA Considerations
This section has no actions for IANA.
6. Sparse Files WARNING: Most of this section needs to be reworked because of the
work going on in the ADB section.
6.1. Introduction 7.1. Introduction
A sparse file is a common way of representing a large file without A sparse file is a common way of representing a large file without
having to utilize all of the disk space for it. Consequently, a having to utilize all of the disk space for it. Consequently, a
sparse file uses less physical space than its size indicates. This sparse file uses less physical space than its size indicates. This
means the file contains 'holes', byte ranges within the file that means the file contains 'holes', byte ranges within the file that
contain no data. Most modern file systems support sparse files, contain no data. Most modern file systems support sparse files,
including most UNIX file systems and NTFS, but notably not Apple's including most UNIX file systems and NTFS, but notably not Apple's
HFS+. Common examples of sparse files include Virtual Machine (VM) HFS+. Common examples of sparse files include Virtual Machine (VM)
OS/disk images, database files, log files, and even checkpoint OS/disk images, database files, log files, and even checkpoint
recovery files most commonly used by the HPC community. recovery files most commonly used by the HPC community.
skipping to change at page 50, line 11 skipping to change at page 58, line 38
Besides reading sparse files and initializing them, applications Besides reading sparse files and initializing them, applications
might want to hole punch, which is the deallocation of the data might want to hole punch, which is the deallocation of the data
blocks which back a region of the file. At such time, the affected blocks which back a region of the file. At such time, the affected
blocks are reinitialized to a pattern. blocks are reinitialized to a pattern.
This section introduces a new operation to read patterns from a file, This section introduces a new operation to read patterns from a file,
READ_PLUS, and a new operation to both initialize patterns and to READ_PLUS, and a new operation to both initialize patterns and to
punch pattern holes into a file, WRITE_PLUS. READ_PLUS supports all punch pattern holes into a file, WRITE_PLUS. READ_PLUS supports all
the features of READ but includes an extension to support sparse the features of READ but includes an extension to support sparse
pattern files. In addition, the return value of READ_PLUS is now pattern files. READ_PLUS is guaranteed to perform no worse than
compatible with NFSv4.1 minor versioning rules and could support READ, and can dramatically improve performance with sparse files.
other future extensions without requiring yet another operation. READ_PLUS does not depend on pNFS protocol features, but can be used
READ_PLUS is guaranteed to perform no worse than READ, and can by pNFS to support sparse files.
dramatically improve performance with sparse files. READ_PLUS does
not depend on pNFS protocol features, but can be used by pNFS to
support sparse files.
6.2. Terminology 7.2. Terminology
Regular file: An object of file type NF4REG or NF4NAMEDATTR. Regular file: An object of file type NF4REG or NF4NAMEDATTR.
Sparse file: A Regular file that contains one or more Holes. Sparse file: A Regular file that contains one or more Holes.
Hole: A byte range within a Sparse file that contains regions of all Hole: A byte range within a Sparse file that contains regions of all
zeroes. For block-based file systems, this could also be an zeroes. For block-based file systems, this could also be an
unallocated region of the file. unallocated region of the file.
Hole Threshold The minimum length of a Hole as determined by the Hole Threshold The minimum length of a Hole as determined by the
server. If a server chooses to define a Hole Threshold, then it server. If a server chooses to define a Hole Threshold, then it
would not return hole information (nfs_readplusreshole) with a would not return hole information (nfs_readplusreshole) with a
hole_offset and hole_length that specify a range shorter than the hole_offset and hole_length that specify a range shorter than the
Hole Threshold. Hole Threshold.
6.3. Applications and Sparse Files 7.3. Applications and Sparse Files
Applications may cause an NFS client to read holes in a file for Applications may cause an NFS client to read holes in a file for
several reasons. This section describes three different application several reasons. This section describes three different application
workloads that cause the NFS client to transfer data unnecessarily. workloads that cause the NFS client to transfer data unnecessarily.
These workloads are simply examples, and there are probably many more These workloads are simply examples, and there are probably many more
workloads that are negatively impacted by sparse files. workloads that are negatively impacted by sparse files.
The first workload that can cause holes to be read is sequential The first workload that can cause holes to be read is sequential
reads within a sparse file. When this happens, the NFS client may reads within a sparse file. When this happens, the NFS client may
perform read requests ("readahead") into sections of the file not perform read requests ("readahead") into sections of the file not
skipping to change at page 51, line 28 skipping to change at page 60, line 7
supports sparse files and will not write all zero regions, whereas supports sparse files and will not write all zero regions, whereas
scp does not support sparse files and will transfer every byte of the scp does not support sparse files and will transfer every byte of the
file. file.
The third workload is generated by applications that do not utilize The third workload is generated by applications that do not utilize
the NFS client cache, but instead use direct I/O and manage cached the NFS client cache, but instead use direct I/O and manage cached
data independently, e.g., databases. These applications may perform data independently, e.g., databases. These applications may perform
whole file caching with sparse files, which would mean that even the whole file caching with sparse files, which would mean that even the
holes will be transferred to the clients and cached. holes will be transferred to the clients and cached.
6.4. Overview of Sparse Files and NFSv4 7.4. Overview of Sparse Files and NFSv4
This proposal seeks to provide sparse file support to the largest This proposal seeks to provide sparse file support to the largest
number of NFS client and server implementations, and as such proposes number of NFS client and server implementations, and as such proposes
to add a new return code to the mandatory NFSv4.1 READ_PLUS operation to add a new return code to the mandatory NFSv4.1 READ_PLUS operation
instead of proposing additions or extensions of new or existing instead of proposing additions or extensions of new or existing
optional features (such as pNFS). optional features (such as pNFS).
As well, this document seeks to ensure that the proposed extensions As well, this document seeks to ensure that the proposed extensions
are simple and do not transfer data between the client and server are simple and do not transfer data between the client and server
unnecessarily. For example, one possible way to implement sparse unnecessarily. For example, one possible way to implement sparse
skipping to change at page 52, line 28 skipping to change at page 61, line 7
Another way to handle holes is compression, but this not ideal since Another way to handle holes is compression, but this not ideal since
it requires all implementations to agree on a single compression it requires all implementations to agree on a single compression
algorithm and requires a fair amount of computational overhead. algorithm and requires a fair amount of computational overhead.
Note that supporting writing to a sparse file does not require Note that supporting writing to a sparse file does not require
changes to the protocol. Applications and/or NFS implementations can changes to the protocol. Applications and/or NFS implementations can
choose to ignore WRITE requests of all zeroes to the NFS server choose to ignore WRITE requests of all zeroes to the NFS server
without consequence. without consequence.
6.5. Operation 65: READ_PLUS 7.5. Operation 65: READ_PLUS
The section introduces a new read operation, named READ_PLUS, which The section introduces a new read operation, named READ_PLUS, which
allows NFS clients to avoid reading holes in a sparse file. allows NFS clients to avoid reading holes in a sparse file.
READ_PLUS is guaranteed to perform no worse than READ, and can READ_PLUS is guaranteed to perform no worse than READ, and can
dramatically improve performance with sparse files. dramatically improve performance with sparse files.
READ_PLUS supports all the features of the existing NFSv4.1 READ READ_PLUS supports all the features of the existing NFSv4.1 READ
operation [2] and adds a simple yet significant extension to the operation [2] and adds a simple yet significant extension to the
format of its response. The change allows the client to avoid format of its response. The change allows the client to avoid
returning all zeroes from a file hole, wasting computational and returning all zeroes from a file hole, wasting computational and
skipping to change at page 53, line 6 skipping to change at page 61, line 33
contain information about a file that may not even be read in its contain information about a file that may not even be read in its
entirely. entirely.
A new read operation is required due to NFSv4.1 minor versioning A new read operation is required due to NFSv4.1 minor versioning
rules that do not allow modification of existing operation's rules that do not allow modification of existing operation's
arguments or results. READ_PLUS is designed in such a way to allow arguments or results. READ_PLUS is designed in such a way to allow
future extensions to the result structure. The same approach could future extensions to the result structure. The same approach could
be taken to extend the argument structure, but a good use case is be taken to extend the argument structure, but a good use case is
first required to make such a change. first required to make such a change.
6.5.1. ARGUMENT 7.5.1. ARGUMENT
struct READ_PLUS4args { struct READ_PLUS4args {
/* CURRENT_FH: file */ /* CURRENT_FH: file */
stateid4 rpa_stateid; stateid4 rpa_stateid;
offset4 rpa_offset; offset4 rpa_offset;
count4 rpa_count; count4 rpa_count;
}; };
6.5.2. RESULT 7.5.2. RESULT
struct read_plus_hole_info {
offset4 rphi_offset;
length4 rphi_length;
};
enum holeres4 {
HOLE_NOINFO = 0,
HOLE_INFO = 1
};
union read_plus_hole switch (holeres4 resop) {
case HOLE_INFO:
read_plus_hole_info rph_info;
case HOLE_NOINFO:
void;
};
enum read_plusrestype4 {
READ_OK = 0,
READ_HOLE = 1
};
union read_plus_data switch (read_plusrestype4 resop) { union read_plus_content switch (data_content4 content) {
case READ_OK: case NFS4_CONTENT_DATA:
opaque rpd_data<>; opaque rpc_data<>;
case READ_HOLE: case NFS4_CONTENT_APP_BLOCK:
read_plus_hole rpd_hole4; app_data_block4 rpc_block;
case NFS4_CONTENT_HOLE:
length4 rpc_hole_length;
default:
void;
}; };
struct READ_PLUS4resok { /*
bool rpr_eof; * Allow a return of an array of contents.
read_plus_data rpr_data; */
struct read_plus_res4 {
bool rpr_eof;
read_plus_content rpr_contents<>;
}; };
union READ_PLUS4res switch (nfsstat4 status) { union READ_PLUS4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
READ_PLUS4resok resok4; read_plus_res4 resok4;
default: default:
void; void;
}; };
6.5.3. DESCRIPTION 7.5.3. DESCRIPTION
The READ_PLUS operation is based upon the NFSv4.1 READ operation [2], The READ_PLUS operation is based upon the NFSv4.1 READ operation [2],
and similarly reads data from the regular file identified by the and similarly reads data from the regular file identified by the
current filehandle. current filehandle.
The client provides an offset of where the READ_PLUS is to start and The client provides an offset of where the READ_PLUS is to start and
a count of how many bytes are to be read. An offset of zero means to a count of how many bytes are to be read. An offset of zero means to
read data starting at the beginning of the file. If offset is read data starting at the beginning of the file. If offset is
greater than or equal to the size of the file, the status NFS4_OK is greater than or equal to the size of the file, the status NFS4_OK is
returned with nfs_readplusrestype4 set to READ_OK, data length set to returned with nfs_readplusrestype4 set to READ_OK, data length set to
skipping to change at page 56, line 29 skipping to change at page 64, line 15
For a READ_PLUS with a stateid value of all bits equal to zero, the For a READ_PLUS with a stateid value of all bits equal to zero, the
server MAY allow the READ_PLUS to be serviced subject to mandatory server MAY allow the READ_PLUS to be serviced subject to mandatory
byte-range locks or the current share deny modes for the file. For a byte-range locks or the current share deny modes for the file. For a
READ_PLUS with a stateid value of all bits equal to one, the server READ_PLUS with a stateid value of all bits equal to one, the server
MAY allow READ_PLUS operations to bypass locking checks at the MAY allow READ_PLUS operations to bypass locking checks at the
server. server.
On success, the current filehandle retains its value. On success, the current filehandle retains its value.
6.5.4. IMPLEMENTATION 7.5.4. IMPLEMENTATION
If the server returns a "short read" (i.e., fewer data than requested If the server returns a "short read" (i.e., fewer data than requested
and eof is set to FALSE), the client should send another READ_PLUS to and eof is set to FALSE), the client should send another READ_PLUS to
get the remaining data. A server may return less data than requested get the remaining data. A server may return less data than requested
under several circumstances. The file may have been truncated by under several circumstances. The file may have been truncated by
another client or perhaps on the server itself, changing the file another client or perhaps on the server itself, changing the file
size from what the requesting client believes to be the case. This size from what the requesting client believes to be the case. This
would reduce the actual amount of data available to the client. It would reduce the actual amount of data available to the client. It
is possible that the server reduce the transfer size and so return a is possible that the server reduce the transfer size and so return a
short read result. Server resource exhaustion may also occur in a short read result. Server resource exhaustion may also occur in a
skipping to change at page 57, line 16 skipping to change at page 65, line 5
being read, the delegation must be recalled, and the operation cannot being read, the delegation must be recalled, and the operation cannot
proceed until that delegation is returned or revoked. Except where proceed until that delegation is returned or revoked. Except where
this happens very quickly, one or more NFS4ERR_DELAY errors will be this happens very quickly, one or more NFS4ERR_DELAY errors will be
returned to requests made while the delegation remains outstanding. returned to requests made while the delegation remains outstanding.
Normally, delegations will not be recalled as a result of a READ_PLUS Normally, delegations will not be recalled as a result of a READ_PLUS
operation since the recall will occur as a result of an earlier OPEN. operation since the recall will occur as a result of an earlier OPEN.
However, since it is possible for a READ_PLUS to be done with a However, since it is possible for a READ_PLUS to be done with a
special stateid, the server needs to check for this case even though special stateid, the server needs to check for this case even though
the client should have done an OPEN previously. the client should have done an OPEN previously.
6.5.4.1. Additional pNFS Implementation Information 7.5.4.1. Additional pNFS Implementation Information
With pNFS, the semantics of using READ_PLUS remains the same. Any With pNFS, the semantics of using READ_PLUS remains the same. Any
data server MAY return a READ_HOLE result for a READ_PLUS request data server MAY return a READ_HOLE result for a READ_PLUS request
that it receives. that it receives.
When a data server chooses to return a READ_HOLE result, it has the When a data server chooses to return a READ_HOLE result, it has the
option of returning hole information for the data stored on that data option of returning hole information for the data stored on that data
server (as defined by the data layout), but it MUST not return a server (as defined by the data layout), but it MUST not return a
nfs_readplusreshole structure with a byte range that includes data nfs_readplusreshole structure with a byte range that includes data
managed by another data server. managed by another data server.
skipping to change at page 57, line 45 skipping to change at page 65, line 34
A data server should do its best to return as much information about A data server should do its best to return as much information about
a hole as is feasible without having to contact the metadata server. a hole as is feasible without having to contact the metadata server.
If communication with the metadata server is required, then every If communication with the metadata server is required, then every
attempt should be taken to minimize the number of requests. attempt should be taken to minimize the number of requests.
If mandatory locking is enforced, then the data server must also If mandatory locking is enforced, then the data server must also
ensure that to return only information for a Hole that is within the ensure that to return only information for a Hole that is within the
owner's locked byte range. owner's locked byte range.
6.5.5. READ_PLUS with Sparse Files Example 7.5.5. READ_PLUS with Sparse Files Example
To see how the return value READ_HOLE will work, the following table To see how the return value READ_HOLE will work, the following table
describes a sparse file. For each byte range, the file contains describes a sparse file. For each byte range, the file contains
either non-zero data or a hole. In addition, the server in this either non-zero data or a hole. In addition, the server in this
example uses a hole threshold of 32K. example uses a hole threshold of 32K.
+-------------+----------+ +-------------+----------+
| Byte-Range | Contents | | Byte-Range | Contents |
+-------------+----------+ +-------------+----------+
| 0-15999 | Hole | | 0-15999 | Hole |
skipping to change at page 58, line 43 skipping to change at page 66, line 30
3. READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 3. READ_PLUS(s, 256K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
eof = false, data<>[32K]. Return a short read, as the last half eof = false, data<>[32K]. Return a short read, as the last half
of the request was all zeroes. of the request was all zeroes.
4. READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE, 4. READ_PLUS(s, 288K, 64K) --> NFS_OK, readplusrestype4 = READ_HOLE,
nfs_readplusreshole(HOLE_INFO)(288K, 66K). nfs_readplusreshole(HOLE_INFO)(288K, 66K).
5. READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK, 5. READ_PLUS(s, 354K, 64K) --> NFS_OK, readplusrestype4 = READ_OK,
eof = true, data<>[64K]. eof = true, data<>[64K].
6.6. Related Work 7.6. Related Work
Solaris and ZFS support an extension to lseek(2) that allows Solaris and ZFS support an extension to lseek(2) that allows
applications to discover holes in a file. The values, SEEK_HOLE and applications to discover holes in a file. The values, SEEK_HOLE and
SEEK_DATA, allow clients to seek to the next hole or beginning of SEEK_DATA, allow clients to seek to the next hole or beginning of
data, respectively. data, respectively.
XFS supports the XFS_IOC_GETBMAP extended attribute, which returns XFS supports the XFS_IOC_GETBMAP extended attribute, which returns
the Data Region Map for a file. Clients can then use this the Data Region Map for a file. Clients can then use this
information to avoid reading holes in a file. information to avoid reading holes in a file.
NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows NTFS and CIFS support the FSCTL_SET_SPARSE attribute, which allows
applications to control whether empty regions of the file are applications to control whether empty regions of the file are
preallocated and filled in with zeros or simply left unallocated. preallocated and filled in with zeros or simply left unallocated.
6.7. Other Proposed Designs 7.7. Other Proposed Designs
6.7.1. Multi-Data Server Hole Information 7.7.1. Multi-Data Server Hole Information
The current design prohibits pnfs data servers from returning hole The current design prohibits pnfs data servers from returning hole
information for regions of a file that are not stored on that data information for regions of a file that are not stored on that data
server. Having data servers return information regarding other data server. Having data servers return information regarding other data
servers changes the fundamental principal that all metadata servers changes the fundamental principal that all metadata
information comes from the metadata server. information comes from the metadata server.
Here is a brief description if we did choose to support multi-data Here is a brief description if we did choose to support multi-data
server hole information: server hole information:
For a data server that can obtain hole information for the entire For a data server that can obtain hole information for the entire
file without severe performance impact, it MAY return HOLE_INFO and file without severe performance impact, it MAY return HOLE_INFO and
the byte range of the entire file hole. When a pNFS client receives the byte range of the entire file hole. When a pNFS client receives
a READ_HOLE result and a non-empty nfs_readplusreshole structure, it a READ_HOLE result and a non-empty nfs_readplusreshole structure, it
MAY use this information in conjunction with a valid layout for the MAY use this information in conjunction with a valid layout for the
file to determine the next data server for the next region of data file to determine the next data server for the next region of data
that is not in a hole. that is not in a hole.
6.7.2. Data Result Array 7.7.2. Data Result Array
If a single read request contains one or more Holes with a length If a single read request contains one or more Holes with a length
greater than the Sparse Threshold, the current design would return greater than the Sparse Threshold, the current design would return
results indicating a short read to the client. A client would then results indicating a short read to the client. A client would then
send a series of read requests to the server to retrieve information send a series of read requests to the server to retrieve information
for the Holes and the remaining data. To avoid turning a single read for the Holes and the remaining data. To avoid turning a single read
request into several exchanges between the client and server, the request into several exchanges between the client and server, the
server may need to choose a relatively large Sparse Threshold in server may need to choose a relatively large Sparse Threshold in
order to decrease the number of short reads it creates. A large order to decrease the number of short reads it creates. A large
Sparse Threshold may miss many smaller holes, which in turn may Sparse Threshold may miss many smaller holes, which in turn may
skipping to change at page 60, line 5 skipping to change at page 67, line 40
To avoid this situation, one option is to have the READ_PLUS To avoid this situation, one option is to have the READ_PLUS
operation return information for multiple holes in a single return operation return information for multiple holes in a single return
value. This would allow several small holes to be described in a value. This would allow several small holes to be described in a
single read response without requiring multliple exchanges between single read response without requiring multliple exchanges between
the client and server. the client and server.
One important item to consider with returning an array of data chunks One important item to consider with returning an array of data chunks
is its impact on RDMA, which may use different block sizes on the is its impact on RDMA, which may use different block sizes on the
client and server (among other things). client and server (among other things).
6.7.3. User-Defined Sparse Mask 7.7.3. User-Defined Sparse Mask
Add mask (instead of just zeroes). Specified by server or client? Add mask (instead of just zeroes). Specified by server or client?
6.7.4. Allocated flag 7.7.4. Allocated flag
A Hole on the server may be an allocated byte-range consisting of all A Hole on the server may be an allocated byte-range consisting of all
zeroes or may not be allocated at all. To ensure this information is zeroes or may not be allocated at all. To ensure this information is
properly communicated to the client, it may be beneficial to add a properly communicated to the client, it may be beneficial to add a
'alloc' flag to the HOLE_INFO section of nfs_readplusreshole. This 'alloc' flag to the HOLE_INFO section of nfs_readplusreshole. This
would allow an NFS client to copy a file from one file system to would allow an NFS client to copy a file from one file system to
another and have it more closely resemble the original. another and have it more closely resemble the original.
6.7.5. Dense and Sparse pNFS File Layouts 7.7.5. Dense and Sparse pNFS File Layouts
The hole information returned form a data server must be understood The hole information returned form a data server must be understood
by pNFS clients using both Dense or Sparse file layout types. Does by pNFS clients using both Dense or Sparse file layout types. Does
the current READ_PLUS return value work for both layout types? Does the current READ_PLUS return value work for both layout types? Does
the data server know if it is using dense or sparse so that it can the data server know if it is using dense or sparse so that it can
return the correct hole_offset and hole_length values? return the correct hole_offset and hole_length values?
7. Security Considerations 8. Security Considerations
8. IANA Considerations 9. IANA Considerations
This section uses terms that are defined in [17]. This section uses terms that are defined in [21].
9. References 10. References
9.1. Normative References 10.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", March 1997. Levels", March 1997.
[2] Shepler, S., Eisler, M., and D. Noveck, "Network File System [2] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 Protocol", RFC 5661, (NFS) Version 4 Minor Version 1 Protocol", RFC 5661,
January 2010. January 2010.
[3] Haynes, T., "Network File System (NFS) Version 4 Minor Version [3] Haynes, T., "Network File System (NFS) Version 4 Minor Version
2 External Data Representation Standard (XDR) Description", 2 External Data Representation Standard (XDR) Description",
skipping to change at page 61, line 22 skipping to change at page 69, line 8
[7] Shepler, S., Eisler, M., and D. Noveck, "Network File System [7] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 External Data Representation (NFS) Version 4 Minor Version 1 External Data Representation
Standard (XDR) Description", RFC 5662, January 2010. Standard (XDR) Description", RFC 5662, January 2010.
[8] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS) [8] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS (pNFS)
Block/Volume Layout", RFC 5663, January 2010. Block/Volume Layout", RFC 5663, January 2010.
[9] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol [9] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
Specification", RFC 2203, September 1997. Specification", RFC 2203, September 1997.
9.2. Informative References 10.2. Informative References
[10] Haynes, T. and D. Noveck, "Network File System (NFS) version 4 [10] Haynes, T. and D. Noveck, "Network File System (NFS) version 4
Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress), Protocol", draft-ietf-nfsv4-rfc3530bis-09 (Work In Progress),
March 2011. March 2011.
[11] Eisler, M., "XDR: External Data Representation Standard", [11] Eisler, M., "XDR: External Data Representation Standard",
RFC 4506, May 2006. RFC 4506, May 2006.
[12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik, [12] Lentini, J., Everhart, C., Ellard, D., Tewari, R., and M. Naik,
"NSDB Protocol for Federated Filesystems", "NSDB Protocol for Federated Filesystems",
skipping to change at page 61, line 50 skipping to change at page 69, line 36
[14] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., [14] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
HTTP/1.1", RFC 2616, June 1999. HTTP/1.1", RFC 2616, June 1999.
[15] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, [15] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9,
RFC 959, October 1985. RFC 959, October 1985.
[16] Simpson, W., "PPP Challenge Handshake Authentication Protocol [16] Simpson, W., "PPP Challenge Handshake Authentication Protocol
(CHAP)", RFC 1994, August 1996. (CHAP)", RFC 1994, August 1996.
[17] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA [17] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
Oracle Database Concepts 11g Release 1 (11.1)", January 2011.
[18] Ashdown, L., "Chapter 15, Validating Database Files and
Backups, of Oracle Database Backup and Recovery User's Guide
11g Release 1 (11.1)", August 2008.
[19] McDougall, R. and J. Mauro, "Section 11.4.3, Detecting Memory
Corruption of Solaris Internals", 2007.
[20] Bairavasundaram, L., Goodson, G., Schroeder, B., Arpaci-
Dusseau, A., and R. Arpaci-Dusseau, "An Analysis of Data
Corruption in the Storage Stack", Proceedings of the 6th USENIX
Symposium on File and Storage Technologies (FAST '08) , 2008.
[21] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.
[18] Nowicki, B., "NFS: Network File System Protocol specification", [22] Nowicki, B., "NFS: Network File System Protocol specification",
RFC 1094, March 1989. RFC 1094, March 1989.
[19] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 [23] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3
Protocol Specification", RFC 1813, June 1995. Protocol Specification", RFC 1813, June 1995.
[20] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", [24] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
RFC 1833, August 1995. RFC 1833, August 1995.
[21] Eisler, M., "NFS Version 2 and Version 3 Security Issues and [25] Eisler, M., "NFS Version 2 and Version 3 Security Issues and
the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5",
RFC 2623, June 1999. RFC 2623, June 1999.
[22] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997. [26] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997.
[23] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624, [27] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624,
June 1999. June 1999.
[24] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On- [28] Reynolds, J., "Assigned Numbers: RFC 1700 is Replaced by an On-
line Database", RFC 3232, January 2002. line Database", RFC 3232, January 2002.
[25] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964, [29] Linn, J., "The Kerberos Version 5 GSS-API Mechanism", RFC 1964,
June 1996. June 1996.
[26] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, [30] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
C., Eisler, M., and D. Noveck, "Network File System (NFS) C., Eisler, M., and D. Noveck, "Network File System (NFS)
version 4 Protocol", RFC 3530, April 2003. version 4 Protocol", RFC 3530, April 2003.
[27] Strohm, R., "Chapter 2, Data Blocks, Extents, and Segments, of
Oracle Database Concepts 11g Release 1 (11.1)", January 2011.
Appendix A. Acknowledgments Appendix A. Acknowledgments
For the pNFS Access Permissions Check, the original draft was by For the pNFS Access Permissions Check, the original draft was by
Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work Sorin Faibish, David Black, Mike Eisler, and Jason Glasgow. The work
was influenced by discussions with Benny Halevy and Bruce Fields. A was influenced by discussions with Benny Halevy and Bruce Fields. A
review was done by Tom Haynes. review was done by Tom Haynes.
For the Sharing change attribute implementation details with NFSv4 For the Sharing change attribute implementation details with NFSv4
clients, the original draft was by Trond Myklebust. clients, the original draft was by Trond Myklebust.
 End of changes. 60 change blocks. 
157 lines changed or deleted 548 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/