draft-ietf-nfsv4-flex-files-04.txt | draft-ietf-nfsv4-flex-files-05.txt | |||
---|---|---|---|---|
NFSv4 B. Halevy | NFSv4 B. Halevy | |||
Internet-Draft T. Haynes | Internet-Draft | |||
Intended status: Informational Primary Data | Intended status: Standards Track T. Haynes | |||
Expires: June 7, 2015 December 04, 2014 | Expires: August 13, 2015 Primary Data | |||
February 09, 2015 | ||||
Parallel NFS (pNFS) Flexible File Layout | Parallel NFS (pNFS) Flexible File Layout | |||
draft-ietf-nfsv4-flex-files-04.txt | draft-ietf-nfsv4-flex-files-05.txt | |||
Abstract | Abstract | |||
The Parallel Network File System (pNFS) allows a separation between | The Parallel Network File System (pNFS) allows a separation between | |||
the metadata and data for a file. The metadata file access is | the metadata (onto a metadata server) and data (onto a storage | |||
handled via Network File System version 4 (NFSv4) minor version 1 | device) for a file. The Flexible File Layout Type is defined in this | |||
(NFSv4.1) and the data file access is specific to the protocol being | document as an extension to pNFS to allow the use of storage devices | |||
used between the client and storage device. The client is informed | in a fashion such that they require only a quite limited degree of | |||
by the metadata server as to which protocol to use via a Layout Type. | interaction with the metadata server, using already existing | |||
The Flexible File Layout Type is defined in this document as an | protocols. Client side mirroring is also added to provide | |||
extension to NFSv4.1 to allow the use of storage devices which need | replication of files. | |||
not be tightly coupled to the metadata server. | ||||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on June 7, 2015. | This Internet-Draft will expire on August 13, 2015. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2014 IETF Trust and the persons identified as the | Copyright (c) 2015 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 | 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
1.2. Difference Between a Data Server and a Storage Device . . 5 | 1.2. Difference Between a Data Server and a Storage Device . . 5 | |||
1.3. Requirements Language . . . . . . . . . . . . . . . . . . 5 | 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6 | |||
2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 | 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 | |||
2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 | 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 | |||
2.2. Security Models . . . . . . . . . . . . . . . . . . . . . 6 | 2.2. Security Models . . . . . . . . . . . . . . . . . . . . . 6 | |||
2.3. State and Locking Models . . . . . . . . . . . . . . . . 7 | 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7 | |||
3. XDR Description of the Flexible File Layout Type . . . . . . 7 | 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 7 | |||
3.1. Code Components Licensing Notice . . . . . . . . . . . . 8 | 2.3. State and Locking Models . . . . . . . . . . . . . . . . 8 | |||
4. Device Addressing and Discovery . . . . . . . . . . . . . . . 9 | 3. XDR Description of the Flexible File Layout Type . . . . . . 9 | |||
4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 9 | 3.1. Code Components Licensing Notice . . . . . . . . . . . . 9 | |||
4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 11 | 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 11 | |||
5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 12 | 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 11 | |||
5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 12 | 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 12 | |||
5.2. Interactions Between Devices and Layouts . . . . . . . . 15 | 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 13 | |||
5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 15 | 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 14 | |||
6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 16 | 5.2. Interactions Between Devices and Layouts . . . . . . . . 17 | |||
7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 16 | 5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 17 | |||
8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 18 | |||
8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 17 | 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 18 | |||
8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 18 | 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 19 | |||
8.3. Metadata Server Resilvering of the File . . . . . . . . . 18 | 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 20 | |||
9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 19 | 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 20 | |||
9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 19 | 8.3. Metadata Server Resilvering of the File . . . . . . . . . 21 | |||
9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 19 | 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 21 | |||
9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 20 | 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 22 | |||
9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 20 | 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 22 | |||
9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 21 | 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 23 | |||
9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 22 | 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 23 | |||
9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 23 | 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 23 | |||
10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 23 | 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 24 | |||
11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 23 | 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 25 | |||
12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 24 | 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 25 | |||
12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 24 | 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 25 | |||
13. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 25 | 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 26 | |||
13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 25 | 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 26 | |||
14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 26 | 13. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 27 | |||
15. Security Considerations . . . . . . . . . . . . . . . . . . . 26 | 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 27 | |||
15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 27 | 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 28 | |||
15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 27 | 15. Security Considerations . . . . . . . . . . . . . . . . . . . 28 | |||
15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 27 | 15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 29 | |||
16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 | 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 29 | |||
17. References . . . . . . . . . . . . . . . . . . . . . . . . . 28 | 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 29 | |||
17.1. Normative References . . . . . . . . . . . . . . . . . . 28 | 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 | |||
17.2. Informative References . . . . . . . . . . . . . . . . . 29 | 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 30 | |||
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 29 | 17.1. Normative References . . . . . . . . . . . . . . . . . . 30 | |||
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 29 | 17.2. Informative References . . . . . . . . . . . . . . . . . 31 | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 29 | Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 31 | |||
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 31 | ||||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31 | ||||
1. Introduction | 1. Introduction | |||
In the parallel Network File System (pNFS), the metadata server | In the parallel Network File System (pNFS), the metadata server | |||
returns Layout Type structures that describe where file data is | returns Layout Type structures that describe where file data is | |||
located. There are different Layout Types for different storage | located. There are different Layout Types for different storage | |||
systems and methods of arranging data on storage devices. This | systems and methods of arranging data on storage devices. This | |||
document defines the Flexible File Layout Type used with file-based | document defines the Flexible File Layout Type used with file-based | |||
data servers that are accessed using the Network File System (NFS) | data servers that are accessed using the Network File System (NFS) | |||
protocols: NFSv3 [RFC1813], NFSv4 [RFC3530], NFSv4.1 [RFC5661], and | protocols: NFSv3 [RFC1813], NFSv4.0 [RFCNFSv4], NFSv4.1 [RFC5661], | |||
NFSv4.2 [NFSv42]. | and NFSv4.2 [NFSv42]. | |||
To provide a global state model equivalent to that of the Files | To provide a global state model equivalent to that of the Files | |||
Layout Type, a back-end control protocol MAY be implemented between | Layout Type, a back-end control protocol MAY be implemented between | |||
the metadata server and NFSv4.1 storage devices. It is out of scope | the metadata server and NFSv4.1+ storage devices. It is out of scope | |||
for this document to specify the wire protocol of such a protocol, | for this document to specify the wire protocol of such a protocol, | |||
yet the requirements for the protocol are specified in [RFC5661] and | yet the requirements for the protocol are specified in [RFC5661] and | |||
clarified in [pNFSLayouts]. | clarified in [pNFSLayouts]. | |||
1.1. Definitions | 1.1. Definitions | |||
control protocol: is a set of requirements for the communication of | control protocol: is a set of requirements for the communication of | |||
information on layouts, stateids, file metadata, and file data | information on layouts, stateids, file metadata, and file data | |||
between the metadata server and the storage devices (see | between the metadata server and the storage devices (see | |||
[pNFSLayouts]). | [pNFSLayouts]). | |||
client-side mirroring: is when the client and not the server is | client-side mirroring: is when the client and not the server is | |||
responsible for updating all of the mirrored copies of a file. | responsible for updating all of the mirrored copies of a layout | |||
segment. | ||||
data file: is that part of the file system object which describes | data file: is that part of the file system object which describes | |||
the payload and not the object. E.g., it is the file contents. | the payload and not the object. E.g., it is the file contents. | |||
data server (DS): is one of the pNFS servers which provide the | data server (DS): is one of the pNFS servers which provides the | |||
contents of a file system object which is a regular file. | contents of a file system object which is a regular file. | |||
Depending on the layout, there might be one or more data servers | Depending on the layout, there might be one or more data servers | |||
over which the data is striped. Note that while the metadata | over which the data is striped. Note that while the metadata | |||
server is strictly accessed over the NFSv4.1 protocol, depending | server is strictly accessed over the NFSv4.1+ protocol, depending | |||
on the Layout Type, the data server could be accessed via any | on the Layout Type, the data server could be accessed via any | |||
protocol that meets the pNFS requirements. | protocol that meets the pNFS requirements. | |||
fencing: is when the metadata server prevents the storage devices | fencing: is when the metadata server prevents the storage devices | |||
from processing I/O from a specific client to a specific file. | from processing I/O from a specific client to a specific file. | |||
File Layout Type: is a Layout Type in which the storage devices are | File Layout Type: is a Layout Type in which the storage devices are | |||
accessed via the NFSv4.1 protocol. It is defined in Section 13 of | accessed via the NFS protocol. | |||
[RFC5661]. | ||||
layout: informs a client of which storage devices it needs to | layout: informs a client of which storage devices it needs to | |||
communicate with (and over which protocol) to perform I/O on a | communicate with (and over which protocol) to perform I/O on a | |||
file. The layout might also provide some hints about how the | file. The layout might also provide some hints about how the | |||
storage is physically organized. | storage is physically organized. | |||
layout iomode: describes whether the layout granted to the client is | layout iomode: describes whether the layout granted to the client is | |||
for read or read/write I/O. | for read or read/write I/O. | |||
layout segment: describes a sub-division of a layout. That sub- | ||||
division might be by the iomode (see Sections 3.3.20 and 12.2.9 of | ||||
[RFC5661]), a striping pattern (see Section 13.3 of [RFC5661]), or | ||||
requested byte range. | ||||
layout stateid: is a 128-bit quantity returned by a server that | layout stateid: is a 128-bit quantity returned by a server that | |||
uniquely defines the layout state provided by the server for a | uniquely defines the layout state provided by the server for a | |||
specific layout that describes a Layout Type and file (see | specific layout that describes a Layout Type and file (see | |||
Section 12.5.2 of [RFC5661]). Further, Section 12.5.3 describes | Section 12.5.2 of [RFC5661]). Further, Section 12.5.3 describes | |||
the difference between a layout stateid and a normal stateid. | the difference between a layout stateid and a normal stateid. | |||
layout type: describes both the storage protocol used to access the | layout type: describes both the storage protocol used to access the | |||
data and the aggregation scheme used to lays out the file data on | data and the aggregation scheme used to lay out the file data on | |||
the underlying storage devices. | the underlying storage devices. | |||
loose coupling: is when the metadata server and the storage devices | loose coupling: is when the metadata server and the storage devices | |||
do not have a control protocol present. | do not have a control protocol present. | |||
metadata file: is that part of the file system object which | metadata file: is that part of the file system object which | |||
describes the object and not the payload. E.g., it could be the | describes the object and not the payload. E.g., it could be the | |||
time since last modification, access, etc. | time since last modification, access, etc. | |||
metadata server (MDS): is the pNFS server which provides metadata | metadata server (MDS): is the pNFS server which provides metadata | |||
information for a file system object. It also is responsible for | information for a file system object. It also is responsible for | |||
generating layouts for file system objects. Note that the MDS is | generating layouts for file system objects. Note that the MDS is | |||
responsible for directory-based operations. | responsible for directory-based operations. | |||
mirror: is a copy of a file. While mirroring can be used for | mirror: is a copy of a layout segment. While mirroring can be used | |||
backing up a file, the copies can be distributed such that each | for backing up a layout segment, the copies can be distributed | |||
remote site has a locally cached copy. Note that if one copy of | such that each remote site has a locally available copy. Note | |||
the mirror is updated, then all copies must be updated. | that if one copy of the mirror is updated, then all copies must be | |||
updated. | ||||
Object Layout Type: is a Layout Type in which the storage devices | ||||
are accessed via the OSD protocol [ANSI400-2004]. It is defined | ||||
in [RFC5664]. | ||||
recalling a layout: is when the metadata server uses a back channel | recalling a layout: is when the metadata server uses a back channel | |||
to inform the client that the layout is to be returned in a | to inform the client that the layout is to be returned in a | |||
graceful manner. Note that the client could be able to flush any | graceful manner. Note that the client could be able to flush any | |||
writes, etc., before replying to the metadata server. | writes, etc., before replying to the metadata server. | |||
revoking a layout: is when the metadata server invalidates the | revoking a layout: is when the metadata server invalidates the | |||
layout such that neither the metadata server nor any storage | layout such that neither the metadata server nor any storage | |||
device will accept any access from the client with that layout. | device will accept any access from the client with that layout. | |||
resilvering: is the act of rebuilding a mirrored copy of a file from | resilvering: is the act of rebuilding a mirrored copy of a layout | |||
a known good copy of the file. Note that this can also be done to | segment from a known good copy of the layout segment. Note that | |||
create a new mirrored copy of the file. | this can also be done to create a new mirrored copy of the layout | |||
segment. | ||||
rsize: is the data transfer buffer size used for reads. | rsize: is the data transfer buffer size used for reads. | |||
stateid: is a 128-bit quantity returned by a server that uniquely | stateid: is a 128-bit quantity returned by a server that uniquely | |||
defines the open and locking states provided by the server for a | defines the open and locking states provided by the server for a | |||
specific open-owner or lock-owner/open-owner pair for a specific | specific open-owner or lock-owner/open-owner pair for a specific | |||
file and type of lock. | file and type of lock. | |||
storage device: is another term used almost interchangeably with | storage device: is another term used almost interchangeably with | |||
data server. See Section 1.2 for the nuances between the two. | data server. See Section 1.2 for the nuances between the two. | |||
tight coupling: is when the metadata server and the storage devices | tight coupling: is when the metadata server and the storage devices | |||
do have a control protocol present. | do have a control protocol present. | |||
wsize: is the data transfer buffer size used for writes. | wsize: is the data transfer buffer size used for writes. | |||
1.2. Difference Between a Data Server and a Storage Device | 1.2. Difference Between a Data Server and a Storage Device | |||
We defined a data server as a pNFS server, which implies that it can | We defined a data server as a pNFS server, which implies that it can | |||
utilize the NFSv4.1 protocol to communicate with the client. As | utilize the NFSv4.1+ protocol to communicate with the client. As | |||
such, only the File Layout Type would currently meet this | such, only the File Layout Type would currently meet this | |||
requirement. The more generic concept is a storage device, which can | requirement. The more generic concept is a storage device, which can | |||
use any protocol to communicate with the client. The requirements | use any protocol to communicate with the client. The requirements | |||
for a storage device to act together with the metadata server to | for a storage device to act together with the metadata server to | |||
provide data to a client are that there is a Layout Type | provide data to a client are that there is a Layout Type | |||
specification for the given protocol and that the metadata server has | specification for the given protocol and that the metadata server has | |||
granted a layout to the client. Note that nothing precludes there | granted a layout to the client. Note that nothing precludes there | |||
being multiple supported Layout Types (i.e., protocols) between a | being multiple supported Layout Types (i.e., protocols) between a | |||
metadata server, storage devices, and client. | metadata server, storage devices, and client. | |||
skipping to change at page 6, line 14 | skipping to change at page 6, line 20 | |||
2. Coupling of Storage Devices | 2. Coupling of Storage Devices | |||
The coupling of the metadata server with the storage devices can be | The coupling of the metadata server with the storage devices can be | |||
either tight or loose. In a tight coupling, there is a control | either tight or loose. In a tight coupling, there is a control | |||
protocol present to manage security, LAYOUTCOMMITs, etc. With a | protocol present to manage security, LAYOUTCOMMITs, etc. With a | |||
loose coupling, the only control protocol might be a version of NFS. | loose coupling, the only control protocol might be a version of NFS. | |||
As such, semantics for managing security, state, and locking models | As such, semantics for managing security, state, and locking models | |||
MUST be defined. | MUST be defined. | |||
A file is split into metadata and data. The "metadata file" is that | ||||
part of the file stored on the metadata server. The "data file" is | ||||
that part of the file stored on the storage device. And the "file" | ||||
is the combination of the two. | ||||
2.1. LAYOUTCOMMIT | 2.1. LAYOUTCOMMIT | |||
With a tightly coupled system, when the metadata server receives a | With a tightly coupled system, when the metadata server receives a | |||
LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the | LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the | |||
File Layout Type MUST be met (see Section 12.5.4 of [RFC5661]). With | File Layout Type MUST be met (see Section 12.5.4 of [RFC5661]). With | |||
a loosely coupled system, a LAYOUTCOMMIT to the metadata server MUST | a loosely coupled system, a LAYOUTCOMMIT to the metadata server MUST | |||
be proceeded with a COMMIT to the storage device. I.e., it is the | be proceeded with a COMMIT to the storage device. It is the | |||
responsibility of the client to make sure the data file is stable | responsibility of the client to make sure the data file is stable | |||
before the metadata server begins to query the storage devices about | before the metadata server begins to query the storage devices about | |||
the changes to the file. Note that if the client has not done a | the changes to the file. Note that if the client has not done a | |||
COMMIT to the storage device, then the LAYOUTCOMMIT might not be | COMMIT to the storage device, then the LAYOUTCOMMIT might not be | |||
synchronized to the last WRITE operation to the storage device. | synchronized to the last WRITE operation to the storage device. | |||
2.2. Security Models | 2.2. Security Models | |||
With loosely coupled storage devices, the metadata server uses | With loosely coupled storage devices, the metadata server uses | |||
synthetic uids and gids for the data file, where the uid owner of the | synthetic uids and gids for the data file, where the uid owner of the | |||
data file is allowed read/write access and the gid owner is allowed | data file is allowed read/write access and the gid owner is allowed | |||
read only access. As part of the layout, the client is provided with | read only access. As part of the layout (see ffds_user and | |||
the rpc credentials to be used (see ffm_auth in Section 5.1) to | ffds_group in Section 5.1), the client is provided with the user and | |||
access the data file. Fencing off clients is achieved by using | group to be used in the Remote Procedure Call (RPC) [RFC5531] | |||
SETATTR by the server to change the uid and/or gid owners of the data | credentials needed to access the data file. Fencing off of clients | |||
file to implicitly revoke the outstanding rpc credentials. Note: it | is achieved by the metadata server changing the synthetic uid and/or | |||
is recommended to implement common access control methods at the | gid owners of the data file on the storage device to implicitly | |||
storage device filesystem exports level to allow only the metadata | revoke the outstanding RPC credentials. | |||
server root (super user) access to the storage device, and to set the | ||||
owner of all directories holding data files to the root user. This | With this loosely coupled model, the metadata server is not able to | |||
security method, when using weak auth flavors such as AUTH_SYS, | fence off a single client, it forced to fence off all clients. | |||
However, as the other clients react to the fencing, returning their | ||||
layouts and trying to get new ones, the metadata server can hand out | ||||
a new uid and gid to allow access. | ||||
Note: it is recommended to implement common access control methods at | ||||
the storage device filesystem to allow only the metadata server root | ||||
(super user) access to the storage device, and to set the owner of | ||||
all directories holding data files to the root user. This approach | ||||
provides a practical model to enforce access control and fence off | provides a practical model to enforce access control and fence off | |||
cooperative clients, but it can not protect against malicious | cooperative clients, but it can not protect against malicious | |||
clients; hence it provides a level of security equivalent to NFSv3. | clients; hence it provides a level of security equivalent to | |||
AUTH_SYS. | ||||
With tightly coupled storage devices, the metadata server sets the | With tightly coupled storage devices, the metadata server sets the | |||
user and group owners, mode bits, and ACL of the data file to be the | user and group owners, mode bits, and ACL of the data file to be the | |||
same as the metadata file. And the client must authenticate with the | same as the metadata file. And the client must authenticate with the | |||
storage device and go through the same authorization process it would | storage device and go through the same authorization process it would | |||
go through via the metadata server. | go through via the metadata server. | |||
2.2.1. Implementation Notes for Synthetic uids/gids | ||||
The selection method for the synthetic uids and gids to be used for | ||||
fencing in loosely coupled storage devices is strictly an | ||||
implementation issue. An implementation might allow an administrator | ||||
to restrict a range of such ids in the name servers. She might also | ||||
be able to choose an id that would never be used to grant acccess. | ||||
Then when the metadata server had a request to access a file, a | ||||
SETATTR would be sent to the storage device to set the owner and | ||||
group of the data file. The user and group might be selected in a | ||||
round robin fashion from the range of available ids. | ||||
Those ids would be sent back as ffds_user and ffds_group to the | ||||
client. And it would present them as the RPC credentials to the | ||||
storage device. When the client was done accessing the file and the | ||||
metadata server knew that no other client was accessing the file, it | ||||
could reset the owner and group to restrict access to the data file. | ||||
When the metadata server wanted to fence off a client, it would | ||||
change the synthetic uid and/or gid to the restricted ids. Note that | ||||
using a restricted id ensures that there is a change of owner and at | ||||
least one id available that never gets allowed access. | ||||
2.2.2. Example of using Synthetic uids/gids | ||||
The user loghyr creates a file "ompha.c" on the metadata server and | ||||
it creates a corresponding data file on the storage device. | ||||
The metadata server entry may look like: | ||||
-rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c | ||||
On the storage device, it may be assigned some random synthetic uid/ | ||||
gid to deny access: | ||||
-rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c | ||||
When the file is opened on a client, since the layout knows nothing | ||||
about the user (and does not care), whether loghyr or garbo opens the | ||||
file does not matter. The owner and group are modified and those | ||||
values are returned. | ||||
-rw-r----- 1 1066 1067 1697 Dec 4 11:31 data_ompha.c | ||||
The set of synthetic gids on the storage device should be selected | ||||
such that there is no mapping in any of the name services used by the | ||||
storage device. I.e., each group should have no members. | ||||
If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the | ||||
metadata server should return a synthetic uid that is not set on the | ||||
storage device. Only the synthetic gid would be valid. | ||||
The client is thus solely responsible for enforcing file permissions | ||||
in a loosely coupled model. To allow loghyr write access, it will | ||||
send an RPC to the storage device with a credential of 1066:1067. To | ||||
allow garbo read access, it will send an RPC to the storage device | ||||
with a credential of 1067:1067. The value of the uid does not matter | ||||
as long as it is not the synthetic uid granted it when getting the | ||||
layout. | ||||
While pushing the enforcement of permission checking onto the client | ||||
may seem to weaken security, the client may already be responsible | ||||
for enforcing permissions before modificaations are sent to a server. | ||||
With cached writes, the client is always responsible for tracking who | ||||
is modifying a file and making sure to not coalesce requests from | ||||
multiple users into one request. | ||||
2.3. State and Locking Models | 2.3. State and Locking Models | |||
Metadata file OPEN, LOCK, and DELEGATION operations are always | Metadata file OPEN, LOCK, and DELEGATION operations are always | |||
executed only against the metadata server. | executed only against the metadata server. | |||
With NFSv4 storage devices, the metadata server, in response to the | The metadata server responds to state changing operations by | |||
state changing operation, executes them against the respective data | executing them against the respective data files on the storage | |||
files on the storage devices. It then sends the storage device open | devices. It then sends the storage device open stateid as part of | |||
stateid as part of the layout (see the ffm_stateid in Section 5.1) | the layout (see the ffm_stateid in Section 5.1) and it is then used | |||
and it is then used by the client for executing READ/WRITE operations | by the client for executing READ/WRITE operations against the storage | |||
against the storage device. | device. | |||
Standalone NFSv4.1 storage devices that do not return the | Standalone NFSv4.1+ storage devices that do not return the | |||
EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are used the same way | EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are used the same way | |||
as NFSv4 storage devices. | as NFSv4 storage devices. | |||
NFSv4.1 clustered storage devices that do identify themselves with | NFSv4.1+ clustered storage devices that do identify themselves with | |||
the EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID use a back-end | the EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID use a back-end | |||
control protocol as described in [RFC5661] to implement a global | control protocol as described in [RFC5661] to implement a global | |||
stateid model as defined there. | stateid model as defined there. | |||
3. XDR Description of the Flexible File Layout Type | 3. XDR Description of the Flexible File Layout Type | |||
This document contains the external data representation (XDR) | This document contains the external data representation (XDR) | |||
[RFC4506] description of the Flexible File Layout Type. The XDR | [RFC4506] description of the Flexible File Layout Type. The XDR | |||
description is embedded in this document in a way that makes it | description is embedded in this document in a way that makes it | |||
simple for the reader to extract into a ready-to-compile form. The | simple for the reader to extract into a ready-to-compile form. The | |||
skipping to change at page 9, line 36 | skipping to change at page 11, line 21 | |||
/// * %#include <nfsv42.x> | /// * %#include <nfsv42.x> | |||
/// * %#include <rpc_prot.x> | /// * %#include <rpc_prot.x> | |||
/// */ | /// */ | |||
/// | /// | |||
<CODE ENDS> | <CODE ENDS> | |||
4. Device Addressing and Discovery | 4. Device Addressing and Discovery | |||
Data operations to a storage device require the client to know the | Data operations to a storage device require the client to know the | |||
network address of the storage device. The NFSv4.1 GETDEVICEINFO | network address of the storage device. The NFSv4.1+ GETDEVICEINFO | |||
operation (Section 18.40 of [RFC5661]) is used by the client to | operation (Section 18.40 of [RFC5661]) is used by the client to | |||
retrieve that information. | retrieve that information. | |||
4.1. ff_device_addr4 | 4.1. ff_device_addr4 | |||
The ff_device_addr4 data structure is returned by the server as the | The ff_device_addr4 data structure is returned by the server as the | |||
storage protocol specific opaque field da_addr_body in the | storage protocol specific opaque field da_addr_body in the | |||
device_addr4 structure by a successful GETDEVICEINFO operation. | device_addr4 structure by a successful GETDEVICEINFO operation. | |||
<CODE BEGINS> | <CODE BEGINS> | |||
skipping to change at page 12, line 16 | skipping to change at page 14, line 4 | |||
major ID of the server owner. It is not always necessary for the two | major ID of the server owner. It is not always necessary for the two | |||
storage device addresses to designate the same storage device with | storage device addresses to designate the same storage device with | |||
trunking being used. For example, the data could be read-only, and | trunking being used. For example, the data could be read-only, and | |||
the data consist of exact replicas. | the data consist of exact replicas. | |||
5. Flexible File Layout Type | 5. Flexible File Layout Type | |||
The layout4 type is defined in [RFC5662] as follows: | The layout4 type is defined in [RFC5662] as follows: | |||
<CODE BEGINS> | <CODE BEGINS> | |||
enum layouttype4 { | enum layouttype4 { | |||
LAYOUT4_NFSV4_1_FILES = 1, | LAYOUT4_NFSV4_1_FILES = 1, | |||
LAYOUT4_OSD2_OBJECTS = 2, | LAYOUT4_OSD2_OBJECTS = 2, | |||
LAYOUT4_BLOCK_VOLUME = 3, | LAYOUT4_BLOCK_VOLUME = 3, | |||
LAYOUT4_FLEX_FILES = 0x80000005 | LAYOUT4_FLEX_FILES = 4 | |||
[[RFC Editor: please modify the LAYOUT4_FLEX_FILES | [[RFC Editor: please modify the LAYOUT4_FLEX_FILES | |||
to be the layouttype assigned by IANA]] | to be the layouttype assigned by IANA]] | |||
}; | }; | |||
struct layout_content4 { | struct layout_content4 { | |||
layouttype4 loc_type; | layouttype4 loc_type; | |||
opaque loc_body<>; | opaque loc_body<>; | |||
}; | }; | |||
struct layout4 { | struct layout4 { | |||
skipping to change at page 13, line 4 | skipping to change at page 14, line 37 | |||
This document defines structure associated with the layouttype4 value | This document defines structure associated with the layouttype4 value | |||
LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an | LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an | |||
XDR type "opaque". The opaque layout is uninterpreted by the generic | XDR type "opaque". The opaque layout is uninterpreted by the generic | |||
pNFS client layers, but obviously must be interpreted by the Flexible | pNFS client layers, but obviously must be interpreted by the Flexible | |||
File Layout Type implementation. This section defines the structure | File Layout Type implementation. This section defines the structure | |||
of this opaque value, ff_layout4. | of this opaque value, ff_layout4. | |||
5.1. ff_layout4 | 5.1. ff_layout4 | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/// struct ff_data_server4 { | /// struct ff_data_server4 { | |||
/// deviceid4 ffds_deviceid; | /// deviceid4 ffds_deviceid; | |||
/// uint32_t ffds_efficiency; | /// uint32_t ffds_efficiency; | |||
/// stateid4 ffds_stateid; | /// stateid4 ffds_stateid; | |||
/// nfs_fh4 ffds_fh_vers<>; | /// nfs_fh4 ffds_fh_vers<>; | |||
/// opaque_auth ffds_auth; | /// fattr4_owner ffds_user; | |||
/// fattr4_owner_group ffds_group; | ||||
/// }; | /// }; | |||
/// | /// | |||
/// struct ff_mirror4 { | /// struct ff_mirror4 { | |||
/// ff_data_server4 ffm_data_servers<>; | /// ff_data_server4 ffm_data_servers<>; | |||
/// }; | /// }; | |||
/// | /// | |||
/// struct ff_layout4 { | /// struct ff_layout4 { | |||
/// length4 ffl_stripe_unit; | /// length4 ffl_stripe_unit; | |||
/// ff_mirror4 ffl_mirrors<>; | /// ff_mirror4 ffl_mirrors<>; | |||
/// }; | /// }; | |||
/// | /// | |||
<CODE ENDS> | <CODE ENDS> | |||
The ff_layout4 structure specifies a layout over a set of mirrored | The ff_layout4 structure specifies a layout over a set of mirrored | |||
copies of the data file. This mirroring protects against loss of | copies of that portion of the data file described in the current | |||
data files. | layout segment. This mirroring protects against loss of data in | |||
layout segments. Note that while not explicitly shown in the above | ||||
XDR, each layout4 element returned in the logr_layout array of | ||||
LAYOUTGET4res (see Section 18.43.1 of [RFC5661]) descibes a layout | ||||
segment. Hence each ff_layout4 also descibes a layout segment. | ||||
It is possible that the file is concatenated from more than one | It is possible that the file is concatenated from more than one | |||
layout segment. Each layout segment MAY represent different striping | layout segment. Each layout segment MAY represent different striping | |||
parameters, applying respectively only to the layout segment byte | parameters, applying respectively only to the layout segment byte | |||
range. | range. | |||
The ffl_stripe_unit field is the stripe unit size in use for the | The ffl_stripe_unit field is the stripe unit size in use for the | |||
current layout segment. The number of stripes is given inside each | current layout segment. The number of stripes is given inside each | |||
mirror by the number of elements in ffm_data_servers. If the number | mirror by the number of elements in ffm_data_servers. If the number | |||
of stripes is one, then the value for ffl_stripe_unit MUST default to | of stripes is one, then the value for ffl_stripe_unit MUST default to | |||
skipping to change at page 14, line 29 | skipping to change at page 16, line 29 | |||
+-----------+ +-----------+ | +-----------+ +-----------+ | |||
|+-----------+ |+-----------+ | |+-----------+ |+-----------+ | |||
||+-----------+ ||+-----------+ | ||+-----------+ ||+-----------+ | |||
+|| Storage | +|| Storage | | +|| Storage | +|| Storage | | |||
+| Devices | +| Devices | | +| Devices | +| Devices | | |||
+-----------+ +-----------+ | +-----------+ +-----------+ | |||
Figure 1 | Figure 1 | |||
The ffs_mirrors field represents an array of state information for | The ffs_mirrors field represents an array of state information for | |||
each mirrored copy of the file. Each element is described by a | each mirrored copy of the current layout segment. Each element is | |||
ff_mirror4 type. | described by a ff_mirror4 type. | |||
ffds_deviceid provides the deviceid of the storage device holding the | ffds_deviceid provides the deviceid of the storage device holding the | |||
data file. | data file. | |||
ffds_fh_vers is an array of filehandles of the data file matching to | ffds_fh_vers is an array of filehandles of the data file matching to | |||
the available NFS versions on the given storage device. There MUST | the available NFS versions on the given storage device. There MUST | |||
be exactly as many elements in ffds_fh_vers as there are in | be exactly as many elements in ffds_fh_vers as there are in | |||
ffda_versions. Each element of the array corresponds to each | ffda_versions. Each element of the array corresponds to each | |||
ffdv_version and ffdv_minorversion provided for the device. The | ffdv_version and ffdv_minorversion provided for the device. The | |||
array allows for server implementations which have different | array allows for server implementations which have different | |||
skipping to change at page 15, line 5 | skipping to change at page 17, line 5 | |||
See Section 5.3 for how to handle versioning issues between the | See Section 5.3 for how to handle versioning issues between the | |||
client and storage devices. | client and storage devices. | |||
For tight coupling, ffds_stateid provides the stateid to be used by | For tight coupling, ffds_stateid provides the stateid to be used by | |||
the client to access the file. For loose coupling and a NFSv4 | the client to access the file. For loose coupling and a NFSv4 | |||
storage device, the client may use an anonymous stateid to perform I/ | storage device, the client may use an anonymous stateid to perform I/ | |||
O on the storage device as there is no use for the metadata server | O on the storage device as there is no use for the metadata server | |||
stateid (no control protocol). In such a scenario, the server MUST | stateid (no control protocol). In such a scenario, the server MUST | |||
set the ffds_stateid to be zero. | set the ffds_stateid to be zero. | |||
For loosely coupled storage devices, ffds_auth provides the RPC | For loosely coupled storage devices, ffds_user and ffds_group provide | |||
credentials to be used by the client to access the data files. For | the synthetic user and group to be used in the RPC credentials that | |||
tightly coupled storage devices, the server SHOULD use the AUTH_NONE | the client presents to the storage device to access the data files. | |||
flavor and a zero length opaque body to minimize the returned | For tightly coupled storage devices, the user and group on the | |||
structure length. I.e., if ffdv_tightly_coupled (see Section 4.1) is | storage device will be the same as on the metadata server. I.e., if | |||
set, then the client MUST ignore ffds_auth in this case. | ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST | |||
ignore both ffds_user and ffds_group. | ||||
The allowed values for both ffds_user and ffds_group are specified in | ||||
Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group | ||||
strings that consist of decimal numeric values with no leading zeros | ||||
can be given a special interpretation by clients and servers that | ||||
choose to provide such support. The receiver may treat such a user | ||||
or group string as representing the same user as would be represented | ||||
by an NFSv3 uid or gid having the corresponding numeric value. Note | ||||
that if using Kerberos for security, the expectation is that these | ||||
values will be a name@domain string. | ||||
ffds_efficiency describes the metadata server's evaluation as to the | ffds_efficiency describes the metadata server's evaluation as to the | |||
effectiveness of each mirror. Note that this is per layout and not | effectiveness of each mirror. Note that this is per layout and not | |||
per device as the metric may change due to perceived load, | per device as the metric may change due to perceived load, | |||
availability to the metadata server, etc. Higher values denote | availability to the metadata server, etc. Higher values denote | |||
higher perceived utility. The way the client can select the best | higher perceived utility. The way the client can select the best | |||
mirror to access is discussed in Section 8.1. | mirror to access is discussed in Section 8.1. | |||
5.2. Interactions Between Devices and Layouts | 5.2. Interactions Between Devices and Layouts | |||
skipping to change at page 15, line 32 | skipping to change at page 17, line 43 | |||
relationship between multipathing and filehandles can result in | relationship between multipathing and filehandles can result in | |||
either 0, 1, or N filehandles (see Section 13.3). Some rationals for | either 0, 1, or N filehandles (see Section 13.3). Some rationals for | |||
this are clustered servers which share the same filehandle or | this are clustered servers which share the same filehandle or | |||
allowing for multiple read-only copies of the file on the same | allowing for multiple read-only copies of the file on the same | |||
storage device. In the Flexible File Layout Type, while there is an | storage device. In the Flexible File Layout Type, while there is an | |||
array of filehandles, they are independent of the multipathing being | array of filehandles, they are independent of the multipathing being | |||
used. If the metadata server wants to provide multiple read-only | used. If the metadata server wants to provide multiple read-only | |||
copies of the same file on the same storage device, then it should | copies of the same file on the same storage device, then it should | |||
provide multiple ff_device_addr4, each as a mirror. The client can | provide multiple ff_device_addr4, each as a mirror. The client can | |||
then determine that since the ffds_fh_vers are different, then there | then determine that since the ffds_fh_vers are different, then there | |||
multiple copies of the file available. | are multiple copies of the file for the current layout segment | |||
available. | ||||
5.3. Handling Version Errors | 5.3. Handling Version Errors | |||
When the metadata server provides the ffda_versions array in the | When the metadata server provides the ffda_versions array in the | |||
ff_device_addr4 (see Section 4.1), the client is able to determine if | ff_device_addr4 (see Section 4.1), the client is able to determine if | |||
it can not access a storage device with any of the supplied | it can not access a storage device with any of the supplied | |||
ffdv_version and ffdv_minorversion combinations. However, due to the | ffdv_version and ffdv_minorversion combinations. However, due to the | |||
limitations of reporting errors in GETDEVICEINFO (see Section 18.40 | limitations of reporting errors in GETDEVICEINFO (see Section 18.40 | |||
in [RFC5661], the client is not able to specify which specific device | in [RFC5661], the client is not able to specify which specific device | |||
it can not communicate with over one of the provided ffdv_version and | it can not communicate with over one of the provided ffdv_version and | |||
skipping to change at page 16, line 12 | skipping to change at page 18, line 24 | |||
minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the | minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the | |||
error indicates that for all the supplied combinations for | error indicates that for all the supplied combinations for | |||
ffdv_version and ffdv_minorversion, the client can not communicate | ffdv_version and ffdv_minorversion, the client can not communicate | |||
with the storage device. The client can retry the GETDEVICEINFO to | with the storage device. The client can retry the GETDEVICEINFO to | |||
see if the metadata server can provide a different combination or it | see if the metadata server can provide a different combination or it | |||
can fall back to doing the I/O through the metadata server. | can fall back to doing the I/O through the metadata server. | |||
6. Striping via Sparse Mapping | 6. Striping via Sparse Mapping | |||
While other Layout Types support both dense and sparse mapping of | While other Layout Types support both dense and sparse mapping of | |||
logical offsets to phyisical offsets within a file (see for example | logical offsets to physical offsets within a file (see for example | |||
Section 13.4 of [RFC5661]), the Flexible File Layout Type only | Section 13.4 of [RFC5661]), the Flexible File Layout Type only | |||
supports a sparse mapping. | supports a sparse mapping. | |||
With sparse mappings, the logical offset within a file (L) is also | With sparse mappings, the logical offset within a file (L) is also | |||
the physical offset on the storage device. As detailed in | the physical offset on the storage device. As detailed in | |||
Section 13.4.4 of [RFC5661], this results in holes across each | Section 13.4.4 of [RFC5661], this results in holes across each | |||
storage device which does not contain the current stripe index. | storage device which does not contain the current stripe index. | |||
L: logical offset into the file | L: logical offset into the file | |||
skipping to change at page 17, line 19 | skipping to change at page 19, line 29 | |||
LAYOUTGET and retry the I/O operation(s) using the new layout, or the | LAYOUTGET and retry the I/O operation(s) using the new layout, or the | |||
client MAY just retry the I/O operation(s) using regular NFS READ or | client MAY just retry the I/O operation(s) using regular NFS READ or | |||
WRITE operations via the metadata server. The client SHOULD attempt | WRITE operations via the metadata server. The client SHOULD attempt | |||
to retrieve a new layout and retry the I/O operation using the | to retrieve a new layout and retry the I/O operation using the | |||
storage device first and only if the error persists, retry the I/O | storage device first and only if the error persists, retry the I/O | |||
operation via the metadata server. | operation via the metadata server. | |||
8. Mirroring | 8. Mirroring | |||
The Flexible File Layout Type has a simple model in place for the | The Flexible File Layout Type has a simple model in place for the | |||
mirroring of files. There is no assumption that each copy of the | mirroring of the file data constrained by a layout segment. There is | |||
mirror is stored identically on the storage devices, i.e., one device | no assumption that each copy of the mirror is stored identically on | |||
might employ compression or deduplication on the file. However, the | the storage devices, i.e., one device might employ compression or | |||
over the wire transfer of the file contents MUST appear identical. | deduplication on the data. However, the over the wire transfer of | |||
Note, this is a construct of the selected XDR representation that | the file contents MUST appear identical. Note, this is a construct | |||
each mirrored copy of the file has the same striping pattern (see | of the selected XDR representation that each mirrored copy of the | |||
Figure 1). | layout segment has the same striping pattern (see Figure 1). | |||
The metadata server is responsible for determining the number of | The metadata server is responsible for determining the number of | |||
mirrored copies and the location of each mirror. While the client | mirrored copies and the location of each mirror. While the client | |||
may provide a hint to how many copies it wants (see Section 12), the | may provide a hint to how many copies it wants (see Section 12), the | |||
metadata server can ignore that hint and in any event, the client has | metadata server can ignore that hint and in any event, the client has | |||
no means to dictate neither the storage device (which also means the | no means to dictate neither the storage device (which also means the | |||
coupling and/or protocol levels to access the file) nor the location | coupling and/or protocol levels to access the layout segments) nor | |||
of said storage device. | the location of said storage device. | |||
The updating of mirrored files is done via client-side mirroring. | The updating of mirrored layout segments is done via client-side | |||
With this approach, the client is responsible for making sure | mirroring. With this approach, the client is responsible for making | |||
modifications get to all copies of the file it is informed of via the | sure modifications get to all copies of the layout segments it is | |||
layout. If a file is being resilvered to a storage device, that | informed of via the layout. If a layout segments is being resilvered | |||
mirrored copy will not be in the layout. Thus the metadata server | to a storage device, that mirrored copy will not be in the layout. | |||
MUST update that copy until the client is presented it in a layout. | Thus the metadata server MUST update that copy until the client is | |||
Also, if the client is writing to the file via the metadata server, | presented it in a layout. Also, if the client is writing to the | |||
e.g., using an earlier version of the protocol, then the metadata | layout segments via the metadata server, e.g., using an earlier | |||
server MUST update all copies of the mirror. As seen in Section 8.3, | version of the protocol, then the metadata server MUST update all | |||
during the resilvering, the layout is recalled, and the client has to | copies of the mirror. As seen in Section 8.3, during the | |||
make modifications via the metadata server. | resilvering, the layout is recalled, and the client has to make | |||
modifications via the metadata server. | ||||
8.1. Selecting a Mirror | 8.1. Selecting a Mirror | |||
When the metadata server grants a layout to a client, it can let the | When the metadata server grants a layout to a client, it can let the | |||
client know how fast it expects each mirror to be once the request | client know how fast it expects each mirror to be once the request | |||
arrives at the storage devices via the ffds_efficiency member. While | arrives at the storage devices via the ffds_efficiency member. While | |||
the algorithms to calculate that value are left to the metadata | the algorithms to calculate that value are left to the metadata | |||
server implementations, factors that could contribute to that | server implementations, factors that could contribute to that | |||
calculation include speed of the storage device, physical memory | calculation include speed of the storage device, physical memory | |||
available to the device, operating system version, current load, etc. | available to the device, operating system version, current load, etc. | |||
However, what should not be involved in that calculation is a | However, what should not be involved in that calculation is a | |||
perceived network distance between the client and the storage device. | perceived network distance between the client and the storage device. | |||
The client is better situated for making that determination based on | The client is better situated for making that determination based on | |||
past interaction with the storage device over the different available | past interaction with the storage device over the different available | |||
network interfaces between the two. I.e., the metadata server might | network interfaces between the two. I.e., the metadata server might | |||
not know about a transient outage between the client and storage | not know about a transient outage between the client and storage | |||
device because it has no presence on the given subnet. | device because it has no presence on the given subnet. | |||
As such, it is the client which decides which mirror to access for | As such, it is the client which decides which mirror to access for | |||
reading the file. The requirements for writing to a mirrored file | reading the file. The requirements for writing to a mirrored layout | |||
are presented below. | segments are presented below. | |||
8.2. Writing to Mirrors | 8.2. Writing to Mirrors | |||
The client is responsible for updating all mirrored copies of the | The client is responsible for updating all mirrored copies of the | |||
file that it is given in the layout. If all but one copy is updated | layout segments that it is given in the layout. If all but one copy | |||
successfully and the last one provides an error, then the client | is updated successfully and the last one provides an error, then the | |||
needs to return the layout to the metadata server with an error | client needs to return the layout to the metadata server with an | |||
indicating that the update failed to that storage device. | error indicating that the update failed to that storage device. | |||
The metadata server is then responsible for determining if it wants | The metadata server is then responsible for determining if it wants | |||
to remove the errant mirror from the layout, if the mirror has | to remove the errant mirror from the layout, if the mirror has | |||
recovered from some transient error, etc. When the client tries to | recovered from some transient error, etc. When the client tries to | |||
get a new layout, the metadata server informs it of the decision by | get a new layout, the metadata server informs it of the decision by | |||
the contents of the layout. The client MUST NOT make any assumptions | the contents of the layout. The client MUST NOT make any assumptions | |||
that the contents of the previous layout will match those of the new | that the contents of the previous layout will match those of the new | |||
one. If it has updates that were not committed, it MUST resend those | one. If it has updates that were not committed, it MUST resend those | |||
updates to all mirrors. | updates to all mirrors. | |||
8.3. Metadata Server Resilvering of the File | 8.3. Metadata Server Resilvering of the File | |||
The metadata server may elect to create a new mirror of the file at | The metadata server may elect to create a new mirror of the layout | |||
any time. This might be to resilver a copy on a storage device which | segments at any time. This might be to resilver a copy on a storage | |||
was down for servicing, to provide a copy of the file on storage with | device which was down for servicing, to provide a copy of the layout | |||
different storage performance characteristics, etc. As the client | segments on storage with different storage performance | |||
will not be aware of the new mirror and the metadata server will not | characteristics, etc. As the client will not be aware of the new | |||
be aware of updates that the client is making to the file, the | mirror and the metadata server will not be aware of updates that the | |||
metadata server MUST recall the writable layout segment(s) that it is | client is making to the layout segments, the metadata server MUST | |||
resilvering. If the client issues a LAYOUTGET for a writable layout | recall the writable layout segment(s) that it is resilvering. If the | |||
segment which is in the process of being resilvered, then the | client issues a LAYOUTGET for a writable layout segment which is in | |||
metadata server MUST deny that request with a NFS4ERR_LAYOUTTRYLATER. | the process of being resilvered, then the metadata server MUST deny | |||
The client can then perform the I/O through the metadata server. | that request with a NFS4ERR_LAYOUTTRYLATER. The client can then | |||
perform the I/O through the metadata server. | ||||
9. Flexible Files Layout Type Return | 9. Flexible Files Layout Type Return | |||
layoutreturn_file4 is used in the LAYOUTRETURN operation to convey | layoutreturn_file4 is used in the LAYOUTRETURN operation to convey | |||
layout-type specific information to the server. It is defined in | layout-type specific information to the server. It is defined in | |||
[RFC5661] as follows: | [RFC5661] as follows: | |||
<CODE BEGINS> | <CODE BEGINS> | |||
struct layoutreturn_file4 { | struct layoutreturn_file4 { | |||
skipping to change at page 20, line 4 | skipping to change at page 22, line 15 | |||
lrf_body opaque value is defined by ff_layoutreturn4 (See | lrf_body opaque value is defined by ff_layoutreturn4 (See | |||
Section 9.3). It allows the client to report I/O error information | Section 9.3). It allows the client to report I/O error information | |||
or layout usage statistics back to the metadata server as defined | or layout usage statistics back to the metadata server as defined | |||
below. | below. | |||
9.1. I/O Error Reporting | 9.1. I/O Error Reporting | |||
9.1.1. ff_ioerr4 | 9.1.1. ff_ioerr4 | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/// struct ff_ioerr4 { | /// struct ff_ioerr4 { | |||
/// offset4 ffie_offset; | /// offset4 ffie_offset; | |||
/// length4 ffie_length; | /// length4 ffie_length; | |||
/// stateid4 ffie_stateid; | /// stateid4 ffie_stateid; | |||
/// device_error4 ffie_errors; | /// device_error4 ffie_errors<>; | |||
/// }; | /// }; | |||
/// | /// | |||
<CODE ENDS> | <CODE ENDS> | |||
Recall that [NFSv42] defines device_error4 as: | Recall that [NFSv42] defines device_error4 as: | |||
<CODE BEGINS> | <CODE BEGINS> | |||
struct device_error4 { | struct device_error4 { | |||
skipping to change at page 21, line 28 | skipping to change at page 23, line 35 | |||
9.2.2. ff_layoutupdate4 | 9.2.2. ff_layoutupdate4 | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/// struct ff_layoutupdate4 { | /// struct ff_layoutupdate4 { | |||
/// netaddr4 ffl_addr; | /// netaddr4 ffl_addr; | |||
/// nfs_fh4 ffl_fhandle; | /// nfs_fh4 ffl_fhandle; | |||
/// ff_io_latency4 ffl_read; | /// ff_io_latency4 ffl_read; | |||
/// ff_io_latency4 ffl_write; | /// ff_io_latency4 ffl_write; | |||
/// uint32_t ffl_queue_depth; | ||||
/// nfstime4 ffl_duration; | /// nfstime4 ffl_duration; | |||
/// bool ffl_local; | /// bool ffl_local; | |||
/// }; | /// }; | |||
/// | /// | |||
<CODE ENDS> | <CODE ENDS> | |||
ffl_addr differentiates which network address the client connected to | ffl_addr differentiates which network address the client connected to | |||
on the storage device. In the case of multipathing, ffl_fhandle | on the storage device. In the case of multipathing, ffl_fhandle | |||
indicates which read-only copy was selected. ffl_read and ffl_write | indicates which read-only copy was selected. ffl_read and ffl_write | |||
convey the latencies respectively for both read and write operations. | convey the latencies respectively for both read and write operations. | |||
ffl_queue_depth can be used to indicate how long the I/O had to wait | ffl_duration is used to indicate the time period over which the | |||
on internal queues before being serviced. ffl_duration is used to | statistics were collected. ffl_local if true indicates that the I/O | |||
indicate the time period over which the statistics were collected. | was serviced by the client's cache. This flag allows the client to | |||
ffl_local if true indicates that the I/O was serviced by the client's | inform the metadata server about "hot" access to a file it would not | |||
cache. This flag allows the client to inform the metadata server | normally be allowed to report on. | |||
about "hot" access to a file it would not normally be allowed to | ||||
report on. | ||||
9.2.3. ff_iostats4 | 9.2.3. ff_iostats4 | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/// struct ff_iostats4 { | /// struct ff_iostats4 { | |||
/// offset4 ffis_offset; | /// offset4 ffis_offset; | |||
/// length4 ffis_length; | /// length4 ffis_length; | |||
/// stateid4 ffis_stateid; | /// stateid4 ffis_stateid; | |||
/// io_info4 ffis_read; | /// io_info4 ffis_read; | |||
/// io_info4 ffis_write; | /// io_info4 ffis_write; | |||
/// deviceid4 ffis_deviceid; | /// deviceid4 ffis_deviceid; | |||
/// layoutupdate4 ffis_layoutupdate; | /// ff_layoutupdate4 ffis_layoutupdate; | |||
/// }; | /// }; | |||
/// | /// | |||
<CODE ENDS> | <CODE ENDS> | |||
Recall that [NFSv42] defines io_info4 as: | Recall that [NFSv42] defines io_info4 as: | |||
<CODE BEGINS> | <CODE BEGINS> | |||
struct io_info4 { | struct io_info4 { | |||
skipping to change at page 23, line 8 | skipping to change at page 25, line 8 | |||
example, a client can define the default byte range resolution to be | example, a client can define the default byte range resolution to be | |||
1 MB in size and the thresholds for reporting to be 1 MB/second or 10 | 1 MB in size and the thresholds for reporting to be 1 MB/second or 10 | |||
I/O operations per second. For each byte range, ffis_offset and | I/O operations per second. For each byte range, ffis_offset and | |||
ffis_length represent the starting offset of the range and the range | ffis_length represent the starting offset of the range and the range | |||
length in bytes. ffis_read.ii_count, ffis_read.ii_bytes, | length in bytes. ffis_read.ii_count, ffis_read.ii_bytes, | |||
ffis_write.ii_count, and ffis_write.ii_bytes represent, respectively, | ffis_write.ii_count, and ffis_write.ii_bytes represent, respectively, | |||
the number of contiguous read and write I/Os and the respective | the number of contiguous read and write I/Os and the respective | |||
aggregate number of bytes transferred within the reported byte range. | aggregate number of bytes transferred within the reported byte range. | |||
The combination of ffis_deviceid and ffl_addr uniquely identify both | The combination of ffis_deviceid and ffl_addr uniquely identify both | |||
the storage path and the network route to it. Additionally, the | the storage path and the network route to it. Finally, the | |||
ffis_deviceid informs the metadata server as to the version and/or | ||||
minor version being used for I/O to the storage device. Finally, the | ||||
ffl_fhandle allows the metadata server to differentiate between | ffl_fhandle allows the metadata server to differentiate between | |||
multiple read-only copies of the file on the same storage device. | multiple read-only copies of the file on the same storage device. | |||
9.3. ff_layoutreturn4 | 9.3. ff_layoutreturn4 | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/// struct ff_layoutreturn4 { | /// struct ff_layoutreturn4 { | |||
/// ff_ioerr4 fflr_ioerr_report<>; | /// ff_ioerr4 fflr_ioerr_report<>; | |||
/// ff_iostats4 fflr_iostats_report<>; | /// ff_iostats4 fflr_iostats_report<>; | |||
skipping to change at page 26, line 35 | skipping to change at page 28, line 35 | |||
In cases where clients are uncommunicative and their lease has | In cases where clients are uncommunicative and their lease has | |||
expired or when clients fail to return recalled layouts within a | expired or when clients fail to return recalled layouts within a | |||
lease period, at the least the server MAY revoke client layouts and/ | lease period, at the least the server MAY revoke client layouts and/ | |||
or device address mappings and reassign these resources to other | or device address mappings and reassign these resources to other | |||
clients (see "Recalling a Layout" in [RFC5661]). To avoid data | clients (see "Recalling a Layout" in [RFC5661]). To avoid data | |||
corruption, the metadata server MUST fence off the revoked clients | corruption, the metadata server MUST fence off the revoked clients | |||
from the respective data files as described in Section 2.2. | from the respective data files as described in Section 2.2. | |||
15. Security Considerations | 15. Security Considerations | |||
The pNFS extension partitions the NFSv4 file system protocol into two | The pNFS extension partitions the NFSv4.1+ file system protocol into | |||
parts, the control path and the data path (storage protocol). The | two parts, the control path and the data path (storage protocol). | |||
control path contains all the new operations described by this | The control path contains all the new operations described by this | |||
extension; all existing NFSv4 security mechanisms and features apply | extension; all existing NFSv4 security mechanisms and features apply | |||
to the control path. The combination of components in a pNFS system | to the control path. The combination of components in a pNFS system | |||
is required to preserve the security properties of NFSv4 with respect | is required to preserve the security properties of NFSv4.1+ with | |||
to an entity accessing data via a client, including security | respect to an entity accessing data via a client, including security | |||
countermeasures to defend against threats that NFSv4 provides | countermeasures to defend against threats that NFSv4.1+ provides | |||
defenses for in environments where these threats are considered | defenses for in environments where these threats are considered | |||
significant. | significant. | |||
The metadata server enforces the file access-control policy at | The metadata server enforces the file access-control policy at | |||
LAYOUTGET time. The client should use suitable authorization | LAYOUTGET time. The client should use suitable authorization | |||
credentials for getting the layout for the requested iomode (READ or | credentials for getting the layout for the requested iomode (READ or | |||
RW) and the server verifies the permissions and ACL for these | RW) and the server verifies the permissions and ACL for these | |||
credentials, possibly returning NFS4ERR_ACCESS if the client is not | credentials, possibly returning NFS4ERR_ACCESS if the client is not | |||
allowed the requested iomode. If the LAYOUTGET operation succeeds | allowed the requested iomode. If the LAYOUTGET operation succeeds | |||
the client receives, as part of the layout, a set of credentials | the client receives, as part of the layout, a set of credentials | |||
allowing it I/O access to the specified data files corresponding to | allowing it I/O access to the specified data files corresponding to | |||
the requested iomode. When the client acts on I/O operations on | the requested iomode. When the client acts on I/O operations on | |||
behalf of its local users, it MUST authenticate and authorize the | behalf of its local users, it MUST authenticate and authorize the | |||
user by issuing respective OPEN and ACCESS calls to the metadata | user by issuing respective OPEN and ACCESS calls to the metadata | |||
server, similar to having NFSv4 data delegations. If access is | server, similar to having NFSv4 data delegations. If access is | |||
allowed, the client uses the corresponding (READ or RW) credentials | allowed, the client uses the corresponding (READ or RW) credentials | |||
to perform the I/O operations at the data files storage devices. | to perform the I/O operations at the data file's storage devices. | |||
When the metadata server receives a request to change a file's | When the metadata server receives a request to change a file's | |||
permissions or ACL, it SHOULD recall all layouts for that file and it | permissions or ACL, it SHOULD recall all layouts for that file and it | |||
MUST fence off the clients holding outstanding layouts for the | MUST fence off the clients holding outstanding layouts for the | |||
respective file by implicitly invalidating the outstanding | respective file by implicitly invalidating the outstanding | |||
credentials on all data files comprising before committing to the new | credentials on all data files comprising before committing to the new | |||
permissions and ACL. Doing this will ensure that clients re- | permissions and ACL. Doing this will ensure that clients re- | |||
authorize their layouts according to the modified permissions and ACL | authorize their layouts according to the modified permissions and ACL | |||
by requesting new layouts. Recalling the layouts in this case is | by requesting new layouts. Recalling the layouts in this case is | |||
courtesy of the server intended to prevent clients from getting an | courtesy of the server intended to prevent clients from getting an | |||
error on I/Os done after the client was fenced off. | error on I/Os done after the client was fenced off. | |||
skipping to change at page 28, line 28 | skipping to change at page 30, line 28 | |||
[NFSv42] Haynes, T., "NFS Version 4 Minor Version 2", draft-ietf- | [NFSv42] Haynes, T., "NFS Version 4 Minor Version 2", draft-ietf- | |||
nfsv4-minorversion2-28 (Work In Progress), November 2014. | nfsv4-minorversion2-28 (Work In Progress), November 2014. | |||
[RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, | [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, | |||
June 1995. | June 1995. | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
[RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., | ||||
Beame, C., Eisler, M., and D. Noveck, "Network File System | ||||
(NFS) version 4 Protocol", RFC 3530, April 2003. | ||||
[RFC4506] Eisler, M., "XDR: External Data Representation Standard", | [RFC4506] Eisler, M., "XDR: External Data Representation Standard", | |||
STD 67, RFC 4506, May 2006. | STD 67, RFC 4506, May 2006. | |||
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol | ||||
Specification Version 2", RFC 5531, May 2009. | ||||
[RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | |||
"Network File System (NFS) Version 4 Minor Version 1 | "Network File System (NFS) Version 4 Minor Version 1 | |||
Protocol", RFC 5661, January 2010. | Protocol", RFC 5661, January 2010. | |||
[RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | |||
"Network File System (NFS) Version 4 Minor Version 1 | "Network File System (NFS) Version 4 Minor Version 1 | |||
External Data Representation Standard (XDR) Description", | External Data Representation Standard (XDR) Description", | |||
RFC 5662, January 2010. | RFC 5662, January 2010. | |||
[RFC5664] Halevy, B., Ed., Welch, B., Ed., and J. Zelenka, Ed., | [RFCNFSv4] | |||
"Object-Based Parallel NFS (pNFS) Operations", RFC 5664, | Haynes, T. and D. Noveck, "NFS Version 4 Protocol", draft- | |||
January 2010. | ietf-nfsv4-rfc3530bis-35 (work in progress), Dec 2014. | |||
[pNFSLayouts] | [pNFSLayouts] | |||
Haynes, T., "Considerations for a New pNFS Layout Type", | Haynes, T., "Considerations for a New pNFS Layout Type", | |||
draft-ietf-nfsv4-layout-types-02 (Work In Progress), | draft-ietf-nfsv4-layout-types-02 (Work In Progress), | |||
October 2014. | October 2014. | |||
17.2. Informative References | 17.2. Informative References | |||
[ANSI400-2004] | ||||
Weber, R., Ed., "ANSI INCITS 400-2004, Information | ||||
Technology - SCSI Object-Based Storage Device Commands | ||||
(OSD)", December 2004. | ||||
[rpcsec_gssv3] | [rpcsec_gssv3] | |||
Adamson, W. and N. Williams, "Remote Procedure Call (RPC) | Adamson, W. and N. Williams, "Remote Procedure Call (RPC) | |||
Security Version 3", November 2014. | Security Version 3", November 2014. | |||
Appendix A. Acknowledgments | Appendix A. Acknowledgments | |||
Those who provided miscellaneous comments to early drafts of this | Those who provided miscellaneous comments to early drafts of this | |||
document include: Matt W. Benjamin, Adam Emerson, Tom Haynes, J. | document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, | |||
Bruce Fields, and Lev Solomonov. | and Lev Solomonov. | |||
Those who provided miscellaneous comments to the final drafts of this | ||||
document include: Anand Ganesh, Robert Wipfel, Gobikrishnan | ||||
Sundharraj, and Trond Myklebust. | ||||
Idan Kedar caught a nasty bug in the interaction of client side | Idan Kedar caught a nasty bug in the interaction of client side | |||
mirroring and the minor versioning of devices. | mirroring and the minor versioning of devices. | |||
Dave Noveck provided a comprehensive review of the document during | ||||
the working group last call. | ||||
Olga Kornievskaia lead the charge against the use of a credential | ||||
versus a principal in the fencing approach. Andy Adamson and | ||||
Benjamin Kaduk helped to sharpen the focus. | ||||
Appendix B. RFC Editor Notes | Appendix B. RFC Editor Notes | |||
[RFC Editor: please remove this section prior to publishing this | [RFC Editor: please remove this section prior to publishing this | |||
document as an RFC] | document as an RFC] | |||
[RFC Editor: prior to publishing this document as an RFC, please | [RFC Editor: prior to publishing this document as an RFC, please | |||
replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the | replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the | |||
RFC number of this document] | RFC number of this document] | |||
Authors' Addresses | Authors' Addresses | |||
Benny Halevy | Benny Halevy | |||
Primary Data, Inc. | ||||
Email: bhalevy@primarydata.com | ||||
URI: http://www.primarydata.com | ||||
Email: bhalevy@gmail.com | ||||
Thomas Haynes | Thomas Haynes | |||
Primary Data, Inc. | Primary Data, Inc. | |||
4300 El Camino Real Ste 100 | 4300 El Camino Real Ste 100 | |||
Los Altos, CA 94022 | Los Altos, CA 94022 | |||
USA | USA | |||
Phone: +1 408 215 1519 | Phone: +1 408 215 1519 | |||
Email: thomas.haynes@primarydata.com | Email: thomas.haynes@primarydata.com | |||
End of changes. 60 change blocks. | ||||
201 lines changed or deleted | 292 lines changed or added | |||
This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |