draft-ietf-nfsv4-flex-files-12.txt | draft-ietf-nfsv4-flex-files-13.txt | |||
---|---|---|---|---|
NFSv4 B. Halevy | NFSv4 B. Halevy | |||
Internet-Draft | Internet-Draft | |||
Intended status: Standards Track T. Haynes | Intended status: Standards Track T. Haynes | |||
Expires: January 21, 2018 Primary Data | Expires: February 8, 2018 Primary Data | |||
July 20, 2017 | August 07, 2017 | |||
Parallel NFS (pNFS) Flexible File Layout | Parallel NFS (pNFS) Flexible File Layout | |||
draft-ietf-nfsv4-flex-files-12.txt | draft-ietf-nfsv4-flex-files-13.txt | |||
Abstract | Abstract | |||
The Parallel Network File System (pNFS) allows a separation between | The Parallel Network File System (pNFS) allows a separation between | |||
the metadata (onto a metadata server) and data (onto a storage | the metadata (onto a metadata server) and data (onto a storage | |||
device) for a file. The flexible file layout type is defined in this | device) for a file. The flexible file layout type is defined in this | |||
document as an extension to pNFS which allows the use of storage | document as an extension to pNFS which allows the use of storage | |||
devices in a fashion such that they require only a quite limited | devices in a fashion such that they require only a quite limited | |||
degree of interaction with the metadata server, using already | degree of interaction with the metadata server, using already | |||
existing protocols. Client side mirroring is also added to provide | existing protocols. Client side mirroring is also added to provide | |||
skipping to change at page 1, line 38 ¶ | skipping to change at page 1, line 38 ¶ | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on January 21, 2018. | This Internet-Draft will expire on February 8, 2018. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2017 IETF Trust and the persons identified as the | Copyright (c) 2017 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 | 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
1.2. Difference Between a Data Server and a Storage Device . . 5 | 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 5 | |||
1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6 | ||||
2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 | 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 | |||
2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 | 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 | |||
2.2. Fencing Clients from the Storage Device . . . . . . . . . 6 | 2.2. Fencing Clients from the Storage Device . . . . . . . . . 6 | |||
2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7 | 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7 | |||
2.2.2. Example of using Synthetic uids/gids . . . . . . . . 8 | 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 8 | |||
2.3. State and Locking Models . . . . . . . . . . . . . . . . 9 | 2.3. State and Locking Models . . . . . . . . . . . . . . . . 9 | |||
2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 9 | 2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 9 | |||
2.3.2. Tighly Coupled Locking Model . . . . . . . . . . . . 11 | 2.3.2. Tightly Coupled Locking Model . . . . . . . . . . . . 10 | |||
3. XDR Description of the Flexible File Layout Type . . . . . . 12 | 3. XDR Description of the Flexible File Layout Type . . . . . . 12 | |||
3.1. Code Components Licensing Notice . . . . . . . . . . . . 13 | 3.1. Code Components Licensing Notice . . . . . . . . . . . . 13 | |||
4. Device Addressing and Discovery . . . . . . . . . . . . . . . 14 | 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 14 | |||
4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 15 | 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 14 | |||
4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 16 | 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 16 | |||
5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 17 | 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 17 | |||
5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 18 | 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 17 | |||
5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 21 | 5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 21 | |||
5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 22 | 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 21 | |||
5.2. Interactions Between Devices and Layouts . . . . . . . . 22 | 5.2. Interactions Between Devices and Layouts . . . . . . . . 22 | |||
5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 22 | 5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 22 | |||
6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 23 | 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 23 | |||
7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 23 | 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 23 | |||
8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 24 | 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 24 | |||
8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 25 | 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 24 | |||
8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 25 | 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 25 | |||
8.2.1. Single Storage Device Updates Mirrors . . . . . . . . 25 | 8.2.1. Single Storage Device Updates Mirrors . . . . . . . . 25 | |||
8.2.2. Single Storage Device Updates Mirrors . . . . . . . . 26 | 8.2.2. Single Storage Device Updates Mirrors . . . . . . . . 25 | |||
8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 26 | 8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 25 | |||
8.2.4. Handling Write COMMITs . . . . . . . . . . . . . . . 27 | 8.2.4. Handling Write COMMITs . . . . . . . . . . . . . . . 26 | |||
8.3. Metadata Server Resilvering of the File . . . . . . . . . 27 | 8.3. Metadata Server Resilvering of the File . . . . . . . . . 27 | |||
9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 27 | 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 27 | |||
9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 29 | 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 28 | |||
9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 29 | 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 28 | |||
9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 29 | 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 29 | |||
9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 30 | 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 29 | |||
9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 30 | 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 30 | |||
9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 31 | 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 31 | |||
9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 32 | 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 32 | |||
10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 32 | ||||
10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 33 | ||||
11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 33 | 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 33 | |||
12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 33 | 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 33 | |||
12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 34 | 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 33 | |||
13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 34 | 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 34 | |||
13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 34 | 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 34 | |||
14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 35 | 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 35 | |||
15. Security Considerations . . . . . . . . . . . . . . . . . . . 36 | 15. Security Considerations . . . . . . . . . . . . . . . . . . . 35 | |||
15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 37 | 15.1. RPCSEC_GSS and Security Services . . . . . . . . . . . . 36 | |||
15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 37 | 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 36 | |||
15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 37 | 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 36 | |||
16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 37 | 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 37 | |||
17. References . . . . . . . . . . . . . . . . . . . . . . . . . 38 | 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 38 | |||
17.1. Normative References . . . . . . . . . . . . . . . . . . 38 | 17.1. Normative References . . . . . . . . . . . . . . . . . . 38 | |||
17.2. Informative References . . . . . . . . . . . . . . . . . 39 | 17.2. Informative References . . . . . . . . . . . . . . . . . 39 | |||
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 39 | Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 39 | |||
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 40 | Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 39 | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 40 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 40 | |||
1. Introduction | 1. Introduction | |||
In the parallel Network File System (pNFS), the metadata server | In the parallel Network File System (pNFS), the metadata server | |||
returns layout type structures that describe where file data is | returns layout type structures that describe where file data is | |||
located. There are different layout types for different storage | located. There are different layout types for different storage | |||
systems and methods of arranging data on storage devices. This | systems and methods of arranging data on storage devices. This | |||
document defines the flexible file layout type used with file-based | document defines the flexible file layout type used with file-based | |||
data servers that are accessed using the Network File System (NFS) | data servers that are accessed using the Network File System (NFS) | |||
skipping to change at page 3, line 44 ¶ | skipping to change at page 3, line 42 ¶ | |||
To provide a global state model equivalent to that of the files | To provide a global state model equivalent to that of the files | |||
layout type, a back-end control protocol MAY be implemented between | layout type, a back-end control protocol MAY be implemented between | |||
the metadata server and NFSv4.1+ storage devices. It is out of scope | the metadata server and NFSv4.1+ storage devices. It is out of scope | |||
for this document to specify such a protocol, yet the requirements | for this document to specify such a protocol, yet the requirements | |||
for the protocol are specified in [RFC5661] and clarified in | for the protocol are specified in [RFC5661] and clarified in | |||
[pNFSLayouts]. | [pNFSLayouts]. | |||
1.1. Definitions | 1.1. Definitions | |||
control protocol: is a set of requirements for the communication of | control communication requirements: defines for a layout type the | |||
information on layouts, stateids, file metadata, and file data | details regarding information on layouts, stateids, file metadata, | |||
between the metadata server and the storage devices (see | and file data which must be communicated between the metadata | |||
[pNFSLayouts]). | server and the storage devices. | |||
control protocol: defines a particular mechanism that an | ||||
implementation of a layout type would use to meet the control | ||||
communication requirement for that layout type. This need not be | ||||
a protocol as normally understood. In some cases the same | ||||
protocol may be used as a control protocol and data access | ||||
protocol. | ||||
client-side mirroring: is when the client and not the server is | client-side mirroring: is when the client and not the server is | |||
responsible for updating all of the mirrored copies of a layout | responsible for updating all of the mirrored copies of a layout | |||
segment. | segment. | |||
data file: is that part of the file system object which contains the | data file: is that part of the file system object which contains the | |||
content. | content. | |||
data server (DS): is one of the pNFS servers which provides the | data server (DS): is another term for storage device. | |||
contents of a file system object which is a regular file. | ||||
Depending on the layout, there might be one or more data servers | ||||
over which the data is striped. Note that while the metadata | ||||
server is strictly accessed over the NFSv4.1+ protocol, depending | ||||
on the layout type, the data server could be accessed via any | ||||
protocol that meets the pNFS requirements. | ||||
fencing: is when the metadata server prevents the storage devices | fencing: is when the metadata server prevents the storage devices | |||
from processing I/O from a specific client to a specific file. | from processing I/O from a specific client to a specific file. | |||
file layout type: is a layout type in which the storage devices are | file layout type: is a layout type in which the storage devices are | |||
accessed via the NFS protocol (see Section 13 of [RFC5661]). | accessed via the NFS protocol (see Section 13 of [RFC5661]). | |||
layout: informs a client of which storage devices it needs to | layout: informs a client of which storage devices it needs to | |||
communicate with (and over which protocol) to perform I/O on a | communicate with (and over which protocol) to perform I/O on a | |||
file. The layout might also provide some hints about how the | file. The layout might also provide some hints about how the | |||
skipping to change at page 5, line 34 ¶ | skipping to change at page 5, line 34 ¶ | |||
this can also be done to create a new mirrored copy of the layout | this can also be done to create a new mirrored copy of the layout | |||
segment. | segment. | |||
rsize: is the data transfer buffer size used for reads. | rsize: is the data transfer buffer size used for reads. | |||
stateid: is a 128-bit quantity returned by a server that uniquely | stateid: is a 128-bit quantity returned by a server that uniquely | |||
defines the open and locking states provided by the server for a | defines the open and locking states provided by the server for a | |||
specific open-owner or lock-owner/open-owner pair for a specific | specific open-owner or lock-owner/open-owner pair for a specific | |||
file and type of lock. | file and type of lock. | |||
storage device: is another term used almost interchangeably with | storage device: designates the target to which clients may direct I/ | |||
data server. See Section 1.2 for the nuances between the two. | O requests when they hold an appropriate layout. See Section 2.1 | |||
of [pNFSLayouts] for further discussion of the difference between | ||||
a data store and a storage device. | ||||
tight coupling: is when the metadata server and the storage devices | tight coupling: is when the metadata server and the storage devices | |||
do have a control protocol present. | do have a control protocol present. | |||
wsize: is the data transfer buffer size used for writes. | wsize: is the data transfer buffer size used for writes. | |||
1.2. Difference Between a Data Server and a Storage Device | 1.2. Requirements Language | |||
We defined a data server as a pNFS server, which implies that it can | ||||
utilize the NFSv4.1+ protocol to communicate with the client. As | ||||
such, only the file layout type would currently meet this | ||||
requirement. The more generic concept is a storage device, which can | ||||
use any protocol to communicate with the client. The requirements | ||||
for a storage device to act together with the metadata server to | ||||
provide data to a client are that there is a layout type | ||||
specification for the given protocol and that the metadata server has | ||||
granted a layout to the client. Note that nothing precludes there | ||||
being multiple supported layout types (i.e., protocols) between a | ||||
metadata server, storage devices, and client. | ||||
As storage device is the more encompassing terminology, this document | ||||
utilizes it over data server. | ||||
1.3. Requirements Language | ||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
2. Coupling of Storage Devices | 2. Coupling of Storage Devices | |||
The coupling of the metadata server with the storage devices can be | The coupling of the metadata server with the storage devices can be | |||
either tight or loose. In a tight coupling, there is a control | either tight or loose. In a tight coupling, there is a control | |||
protocol present to manage security, LAYOUTCOMMITs, etc. With a | protocol present to manage security, LAYOUTCOMMITs, etc. With a | |||
skipping to change at page 7, line 7 ¶ | skipping to change at page 6, line 44 ¶ | |||
With loosely coupled storage devices, the metadata server uses | With loosely coupled storage devices, the metadata server uses | |||
synthetic uids and gids for the data file, where the uid owner of the | synthetic uids and gids for the data file, where the uid owner of the | |||
data file is allowed read/write access and the gid owner is allowed | data file is allowed read/write access and the gid owner is allowed | |||
read only access. As part of the layout (see ffds_user and | read only access. As part of the layout (see ffds_user and | |||
ffds_group in Section 5.1), the client is provided with the user and | ffds_group in Section 5.1), the client is provided with the user and | |||
group to be used in the Remote Procedure Call (RPC) [RFC5531] | group to be used in the Remote Procedure Call (RPC) [RFC5531] | |||
credentials needed to access the data file. Fencing off of clients | credentials needed to access the data file. Fencing off of clients | |||
is achieved by the metadata server changing the synthetic uid and/or | is achieved by the metadata server changing the synthetic uid and/or | |||
gid owners of the data file on the storage device to implicitly | gid owners of the data file on the storage device to implicitly | |||
revoke the outstanding RPC credentials. A client presenting the | revoke the outstanding RPC credentials. A client presenting the | |||
wrong credential for the deisred access will get a NFS4ERR_ACCESS | wrong credential for the desired access will get a NFS4ERR_ACCESS | |||
error. | error. | |||
With this loosely coupled model, the metadata server is not able to | With this loosely coupled model, the metadata server is not able to | |||
fence off a single client, it is forced to fence off all clients. | fence off a single client, it is forced to fence off all clients. | |||
However, as the other clients react to the fencing, returning their | However, as the other clients react to the fencing, returning their | |||
layouts and trying to get new ones, the metadata server can hand out | layouts and trying to get new ones, the metadata server can hand out | |||
a new uid and gid to allow access. | a new uid and gid to allow access. | |||
Note: it is recommended to implement common access control methods at | Note: it is recommended to implement common access control methods at | |||
the storage device filesystem to allow only the metadata server root | the storage device filesystem to allow only the metadata server root | |||
skipping to change at page 8, line 4 ¶ | skipping to change at page 7, line 41 ¶ | |||
hand out no layout (forcing the I/O through it), or deny the client | hand out no layout (forcing the I/O through it), or deny the client | |||
further access to the file. | further access to the file. | |||
2.2.1. Implementation Notes for Synthetic uids/gids | 2.2.1. Implementation Notes for Synthetic uids/gids | |||
The selection method for the synthetic uids and gids to be used for | The selection method for the synthetic uids and gids to be used for | |||
fencing in loosely coupled storage devices is strictly an | fencing in loosely coupled storage devices is strictly an | |||
implementation issue. I.e., an administrator might restrict a range | implementation issue. I.e., an administrator might restrict a range | |||
of such ids available to the Lightweight Directory Access Protocol | of such ids available to the Lightweight Directory Access Protocol | |||
(LDAP) 'uid' field [RFC4519]. She might also be able to choose an id | (LDAP) 'uid' field [RFC4519]. She might also be able to choose an id | |||
that would never be used to grant acccess. Then when the metadata | that would never be used to grant access. Then when the metadata | |||
server had a request to access a file, a SETATTR would be sent to the | server had a request to access a file, a SETATTR would be sent to the | |||
storage device to set the owner and group of the data file. The user | storage device to set the owner and group of the data file. The user | |||
and group might be selected in a round robin fashion from the range | and group might be selected in a round robin fashion from the range | |||
of available ids. | of available ids. | |||
Those ids would be sent back as ffds_user and ffds_group to the | Those ids would be sent back as ffds_user and ffds_group to the | |||
client. And it would present them as the RPC credentials to the | client. And it would present them as the RPC credentials to the | |||
storage device. When the client was done accessing the file and the | storage device. When the client was done accessing the file and the | |||
metadata server knew that no other client was accessing the file, it | metadata server knew that no other client was accessing the file, it | |||
could reset the owner and group to restrict access to the data file. | could reset the owner and group to restrict access to the data file. | |||
skipping to change at page 10, line 16 ¶ | skipping to change at page 9, line 51 ¶ | |||
follows: | follows: | |||
o OPENs are dealt with by the metadata server. Stateids are | o OPENs are dealt with by the metadata server. Stateids are | |||
selected by the metadata server and associated with the client id | selected by the metadata server and associated with the client id | |||
describing the client's connection to the metadata server. The | describing the client's connection to the metadata server. The | |||
metadata server may need to interact with the storage device to | metadata server may need to interact with the storage device to | |||
locate the file to be opened, but no locking-related functionality | locate the file to be opened, but no locking-related functionality | |||
need be used on the storage device. | need be used on the storage device. | |||
OPEN_DOWNGRADE and CLOSE only require local execution on the | OPEN_DOWNGRADE and CLOSE only require local execution on the | |||
metadata sever. | metadata server. | |||
o Advisory byte-range locks can be implemented locally on the | o Advisory byte-range locks can be implemented locally on the | |||
metadata server. As in the case of OPENs, the stateids associated | metadata server. As in the case of OPENs, the stateids associated | |||
with byte-range locks are assigned by the metadata server and only | with byte-range locks are assigned by the metadata server and only | |||
used on the metadata server. | used on the metadata server. | |||
o Delegations are assigned by the metadata server which initiates | o Delegations are assigned by the metadata server which initiates | |||
recalls when conflicting OPENs are processed. No storage device | recalls when conflicting OPENs are processed. No storage device | |||
involvement is required. | involvement is required. | |||
skipping to change at page 11, line 8 ¶ | skipping to change at page 10, line 44 ¶ | |||
been revoked. | been revoked. | |||
As the client never receives a stateid generated by a storage device, | As the client never receives a stateid generated by a storage device, | |||
there is no client lease on the storage device and no prospect of | there is no client lease on the storage device and no prospect of | |||
lease expiration, even when access is via NFSv4 protocols. Clients | lease expiration, even when access is via NFSv4 protocols. Clients | |||
will have leases on the metadata server. In dealing with lease | will have leases on the metadata server. In dealing with lease | |||
expiration, the metadata server may need to use fencing to prevent | expiration, the metadata server may need to use fencing to prevent | |||
revoked stateids from being relied upon by a client unaware of the | revoked stateids from being relied upon by a client unaware of the | |||
fact that they have been revoked. | fact that they have been revoked. | |||
2.3.2. Tighly Coupled Locking Model | 2.3.2. Tightly Coupled Locking Model | |||
When locking-related operations are requested, they are primarily | When locking-related operations are requested, they are primarily | |||
dealt with by the metadata server, which generates the appropriate | dealt with by the metadata server, which generates the appropriate | |||
stateids. These stateids must be made known to the storage device | stateids. These stateids must be made known to the storage device | |||
using control protocol facilities, the details of which are not | using control protocol facilities, the details of which are not | |||
discussed in this document. | discussed in this document. | |||
Given this basic structure, locking-related operations are handled as | Given this basic structure, locking-related operations are handled as | |||
follows: | follows: | |||
o OPENs are dealt with primarily on the metadata server. Stateids | o OPENs are dealt with primarily on the metadata server. Stateids | |||
are selected by the metadata server and associated with the client | are selected by the metadata server and associated with the client | |||
id describing the client's connection to the metadata server. The | id describing the client's connection to the metadata server. The | |||
metadata server needs to interact with the storage device to | metadata server needs to interact with the storage device to | |||
locate the file to be opened, and to make the storage device aware | locate the file to be opened, and to make the storage device aware | |||
of the association between the metadata-sever-chosen stateid and | of the association between the metadata-server-chosen stateid and | |||
the client and openowner that it represents. | the client and openowner that it represents. | |||
OPEN_DOWNGRADE and CLOSE are executed initially on the metadata | OPEN_DOWNGRADE and CLOSE are executed initially on the metadata | |||
server but the state change made must be propagated to the storage | server but the state change made must be propagated to the storage | |||
device. | device. | |||
o Advisory byte-range locks can be implemented locally on the | o Advisory byte-range locks can be implemented locally on the | |||
metadata server. As in the case of OPENs, the stateids associated | metadata server. As in the case of OPENs, the stateids associated | |||
with byte-range locks, are assigned by the metadata server and are | with byte-range locks, are assigned by the metadata server and are | |||
available for use on the metadata server. Because I/O operations | available for use on the metadata server. Because I/O operations | |||
are allowed to present lock stateids, the metadata server needs | are allowed to present lock stateids, the metadata server needs | |||
the ability to make the storage device aware of the association | the ability to make the storage device aware of the association | |||
between the metadata-sever-chosen stateid and the corresponding | between the metadata-server-chosen stateid and the corresponding | |||
open stateid it is associated with. | open stateid it is associated with. | |||
o Mandatory byte-range locks can be supported when both the metadata | o Mandatory byte-range locks can be supported when both the metadata | |||
server and the storage devices have the appropriate support. As | server and the storage devices have the appropriate support. As | |||
in the case of advisory byte-range locks, these are assigned by | in the case of advisory byte-range locks, these are assigned by | |||
the metadata server and are available for use on the metadata | the metadata server and are available for use on the metadata | |||
server. To enable mandatory lock enforcement on the storage | server. To enable mandatory lock enforcement on the storage | |||
device, the metadata server needs the ability to make the storage | device, the metadata server needs the ability to make the storage | |||
device aware of the association between the metadata-sever-chosen | device aware of the association between the metadata-server-chosen | |||
stateid and the client, openowner, and lock (i.e., lockowner, | stateid and the client, openowner, and lock (i.e., lockowner, | |||
byte-range, lock-type) that it represents. Because I/O operations | byte-range, lock-type) that it represents. Because I/O operations | |||
are allowed to present lock stateids, this information needs to be | are allowed to present lock stateids, this information needs to be | |||
propagated to all storage devices to which I/O might be directed | propagated to all storage devices to which I/O might be directed | |||
rather than only to daya storage device that contain the locked | rather than only to storage device that contain the locked region. | |||
region. | ||||
o Delegations are assigned by the metadata server which initiates | o Delegations are assigned by the metadata server which initiates | |||
recalls when conflicting OPENs are processed. Because I/O | recalls when conflicting OPENs are processed. Because I/O | |||
operations are allowed to present delegation stateids, the | operations are allowed to present delegation stateids, the | |||
metadata server requires the ability to make the storage device | metadata server requires the ability to make the storage device | |||
aware of the association between the metadata-server-chosen | aware of the association between the metadata-server-chosen | |||
stateid and the filehandle and delegation type it represents, and | stateid and the filehandle and delegation type it represents, and | |||
to break such an association. | to break such an association. | |||
o TEST_STATEID is processed locally on the metadata server, without | o TEST_STATEID is processed locally on the metadata server, without | |||
skipping to change at page 16, line 16 ¶ | skipping to change at page 15, line 46 ¶ | |||
The ffdv_rsize and ffdv_wsize are used to communicate the maximum | The ffdv_rsize and ffdv_wsize are used to communicate the maximum | |||
rsize and wsize supported by the storage device. As the storage | rsize and wsize supported by the storage device. As the storage | |||
device can have a different rsize or wsize than the metadata server, | device can have a different rsize or wsize than the metadata server, | |||
the ffdv_rsize and ffdv_wsize allow the metadata server to | the ffdv_rsize and ffdv_wsize allow the metadata server to | |||
communicate that information on behalf of the storage device. | communicate that information on behalf of the storage device. | |||
ffdv_tightly_coupled informs the client as to whether the metadata | ffdv_tightly_coupled informs the client as to whether the metadata | |||
server is tightly coupled with the storage devices or not. Note that | server is tightly coupled with the storage devices or not. Note that | |||
even if the data protocol is at least NFSv4.1, it may still be the | even if the data protocol is at least NFSv4.1, it may still be the | |||
case that there is loose coupling is in effect. If | case that there is loose coupling in effect. If ffdv_tightly_coupled | |||
ffdv_tightly_coupled is not set, then the client MUST commit writes | is not set, then the client MUST commit writes to the storage devices | |||
to the storage devices for the file before sending a LAYOUTCOMMIT to | for the file before sending a LAYOUTCOMMIT to the metadata server. | |||
the metadata server. I.e., the writes MUST be committed by the | I.e., the writes MUST be committed by the client to stable storage | |||
client to stable storage via issuing WRITEs with stable_how == | via issuing WRITEs with stable_how == FILE_SYNC or by issuing a | |||
FILE_SYNC or by issuing a COMMIT after WRITEs with stable_how != | COMMIT after WRITEs with stable_how != FILE_SYNC (see Section 3.3.7 | |||
FILE_SYNC (see Section 3.3.7 of [RFC1813]). | of [RFC1813]). | |||
4.2. Storage Device Multipathing | 4.2. Storage Device Multipathing | |||
The flexible file layout type supports multipathing to multiple | The flexible file layout type supports multipathing to multiple | |||
storage device addresses. Storage device level multipathing is used | storage device addresses. Storage device level multipathing is used | |||
for bandwidth scaling via trunking and for higher availability of use | for bandwidth scaling via trunking and for higher availability of use | |||
in the event of a storage device failure. Multipathing allows the | in the event of a storage device failure. Multipathing allows the | |||
client to switch to another storage device address which may be that | client to switch to another storage device address which may be that | |||
of another storage device that is exporting the same data stripe | of another storage device that is exporting the same data stripe | |||
unit, without having to contact the metadata server for a new layout. | unit, without having to contact the metadata server for a new layout. | |||
skipping to change at page 18, line 4 ¶ | skipping to change at page 17, line 33 ¶ | |||
}; | }; | |||
struct layout4 { | struct layout4 { | |||
offset4 lo_offset; | offset4 lo_offset; | |||
length4 lo_length; | length4 lo_length; | |||
layoutiomode4 lo_iomode; | layoutiomode4 lo_iomode; | |||
layout_content4 lo_content; | layout_content4 lo_content; | |||
}; | }; | |||
<CODE ENDS> | <CODE ENDS> | |||
This document defines structure associated with the layouttype4 value | ||||
LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an | This document defines structures associated with the layouttype4 | |||
XDR type "opaque". The opaque layout is uninterpreted by the generic | value LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure | |||
pNFS client layers, but is interpreted by the flexible file layout | as an XDR type "opaque". The opaque layout is uninterpreted by the | |||
type implementation. This section defines the structure of this | generic pNFS client layers, but is interpreted by the flexible file | |||
otherwise opaque value, ff_layout4. | layout type implementation. This section defines the structure of | |||
this otherwise opaque value, ff_layout4. | ||||
5.1. ff_layout4 | 5.1. ff_layout4 | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/// const FF_FLAGS_NO_LAYOUTCOMMIT = 0x00000001; | /// const FF_FLAGS_NO_LAYOUTCOMMIT = 0x00000001; | |||
/// const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002; | /// const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002; | |||
/// const FF_FLAGS_NO_READ_IO = 0x00000004; | /// const FF_FLAGS_NO_READ_IO = 0x00000004; | |||
/// const FF_FLAGS_WRITE_ONE_MIRROR = 0x00000008; | /// const FF_FLAGS_WRITE_ONE_MIRROR = 0x00000008; | |||
skipping to change at page 20, line 41 ¶ | skipping to change at page 20, line 23 ¶ | |||
NFSv4.x storage protocols: | NFSv4.x storage protocols: | |||
loosely couple: the stateid has to be an anonymous stateid, | loosely couple: the stateid has to be an anonymous stateid, | |||
tightly couple: the stateid has to be a global stateid. | tightly couple: the stateid has to be a global stateid. | |||
These stem from a mismatch of ffds_stateid being a singleton and | These stem from a mismatch of ffds_stateid being a singleton and | |||
ffds_fh_vers being an array - each open file on the storage device | ffds_fh_vers being an array - each open file on the storage device | |||
might need an open stateid. As there are established loosely coupled | might need an open stateid. As there are established loosely coupled | |||
implementations of this version of the protocol, it can not be fixed. | implementations of this version of the protocol, it can not be fixed. | |||
If an implementation needs a different statedid per file handle, then | If an implementation needs a different stateid per file handle, then | |||
this issue will require a new version of the protocol. | this issue will require a new version of the protocol. | |||
For loosely coupled storage devices, ffds_user and ffds_group provide | For loosely coupled storage devices, ffds_user and ffds_group provide | |||
the synthetic user and group to be used in the RPC credentials that | the synthetic user and group to be used in the RPC credentials that | |||
the client presents to the storage device to access the data files. | the client presents to the storage device to access the data files. | |||
For tightly coupled storage devices, the user and group on the | For tightly coupled storage devices, the user and group on the | |||
storage device will be the same as on the metadata server. I.e., if | storage device will be the same as on the metadata server. I.e., if | |||
ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST | ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST | |||
ignore both ffds_user and ffds_group. | ignore both ffds_user and ffds_group. | |||
skipping to change at page 21, line 31 ¶ | skipping to change at page 21, line 14 ¶ | |||
ffl_flags is a bitmap that allows the metadata server to inform the | ffl_flags is a bitmap that allows the metadata server to inform the | |||
client of particular conditions that may result from the more or less | client of particular conditions that may result from the more or less | |||
tight coupling of the storage devices. | tight coupling of the storage devices. | |||
FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is | FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is | |||
not required to send LAYOUTCOMMIT to the metadata server. | not required to send LAYOUTCOMMIT to the metadata server. | |||
F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client | F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client | |||
should not send I/O operations to the metadata server. I.e., even | should not send I/O operations to the metadata server. I.e., even | |||
if the client could determine that there was a network diconnect | if the client could determine that there was a network disconnect | |||
to a storage device, the client should not try to proxy the I/O | to a storage device, the client should not try to proxy the I/O | |||
through the metadata server. | through the metadata server. | |||
FF_FLAGS_NO_READ_IO: can be set to indicate that the client should | FF_FLAGS_NO_READ_IO: can be set to indicate that the client should | |||
not send READ requests with the layouts of iomode | not send READ requests with the layouts of iomode | |||
LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode | LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode | |||
LAYOUTIOMODE4_READ from the metadata server. | LAYOUTIOMODE4_READ from the metadata server. | |||
FF_FLAGS_WRITE_ONE_MIRROR: can be set to indicate that the client | FF_FLAGS_WRITE_ONE_MIRROR: can be set to indicate that the client | |||
only needs to update one of the mirrors (see Section 8.2). | only needs to update one of the mirrors (see Section 8.2). | |||
5.1.1. Error Codes from LAYOUTGET | 5.1.1. Error Codes from LAYOUTGET | |||
[RFC5661] provides little guidance as to how the client is to proceed | [RFC5661] provides little guidance as to how the client is to proceed | |||
with a LAYOUTEGT which returns an error of either | with a LAYOUTGET which returns an error of either | |||
NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. | NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. | |||
Within the context of this document: | Within the context of this document: | |||
NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O | NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O | |||
is to go to the metadata server. Note that it is possible to have | is to go to the metadata server. Note that it is possible to have | |||
had a layout before a recall and not after. | had a layout before a recall and not after. | |||
NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout | NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout | |||
from being granted. If the client already has an appropriate | from being granted. If the client already has an appropriate | |||
layout, it should continue with I/O to the storage devices. | layout, it should continue with I/O to the storage devices. | |||
skipping to change at page 24, line 41 ¶ | skipping to change at page 24, line 24 ¶ | |||
compression or deduplication on the data. However, the over the wire | compression or deduplication on the data. However, the over the wire | |||
transfer of the file contents MUST appear identical. Note, this is a | transfer of the file contents MUST appear identical. Note, this is a | |||
constraint of the selected XDR representation in which each mirrored | constraint of the selected XDR representation in which each mirrored | |||
copy of the layout segment has the same striping pattern (see | copy of the layout segment has the same striping pattern (see | |||
Figure 1). | Figure 1). | |||
The metadata server is responsible for determining the number of | The metadata server is responsible for determining the number of | |||
mirrored copies and the location of each mirror. While the client | mirrored copies and the location of each mirror. While the client | |||
may provide a hint to how many copies it wants (see Section 12), the | may provide a hint to how many copies it wants (see Section 12), the | |||
metadata server can ignore that hint and in any event, the client has | metadata server can ignore that hint and in any event, the client has | |||
no means to dictate neither the storage device (which also means the | no means to dictate either the storage device (which also means the | |||
coupling and/or protocol levels to access the layout segments) nor | coupling and/or protocol levels to access the layout segments) or the | |||
the location of said storage device. | location of said storage device. | |||
The updating of mirrored layout segments is done via client-side | The updating of mirrored layout segments is done via client-side | |||
mirroring. With this approach, the client is responsible for making | mirroring. With this approach, the client is responsible for making | |||
sure modifications are made on all copies of the layout segments it | sure modifications are made on all copies of the layout segments it | |||
is informed of via the layout. If a layout segment is being | is informed of via the layout. If a layout segment is being | |||
resilvered to a storage device, that mirrored copy will not be in the | resilvered to a storage device, that mirrored copy will not be in the | |||
layout. Thus the metadata server MUST update that copy until the | layout. Thus the metadata server MUST update that copy until the | |||
client is presented it in a layout. If the FF_FLAGS_WRITE_ONE_MIRROR | client is presented it in a layout. If the FF_FLAGS_WRITE_ONE_MIRROR | |||
is set in ffl_flags, the client need only update one of the mirrors | is set in ffl_flags, the client need only update one of the mirrors | |||
(see Section 8.2. If the client is writing to the layout segments | (see Section 8.2). If the client is writing to the layout segments | |||
via the metadata server, then the metadata server MUST update all | via the metadata server, then the metadata server MUST update all | |||
copies of the mirror. As seen in Section 8.3, during the | copies of the mirror. As seen in Section 8.3, during the | |||
resilvering, the layout is recalled, and the client has to make | resilvering, the layout is recalled, and the client has to make | |||
modifications via the metadata server. | modifications via the metadata server. | |||
8.1. Selecting a Mirror | 8.1. Selecting a Mirror | |||
When the metadata server grants a layout to a client, it MAY let the | When the metadata server grants a layout to a client, it MAY let the | |||
client know how fast it expects each mirror to be once the request | client know how fast it expects each mirror to be once the request | |||
arrives at the storage devices via the ffds_efficiency member. While | arrives at the storage devices via the ffds_efficiency member. While | |||
skipping to change at page 25, line 29 ¶ | skipping to change at page 25, line 14 ¶ | |||
However, what should not be involved in that calculation is a | However, what should not be involved in that calculation is a | |||
perceived network distance between the client and the storage device. | perceived network distance between the client and the storage device. | |||
The client is better situated for making that determination based on | The client is better situated for making that determination based on | |||
past interaction with the storage device over the different available | past interaction with the storage device over the different available | |||
network interfaces between the two. I.e., the metadata server might | network interfaces between the two. I.e., the metadata server might | |||
not know about a transient outage between the client and storage | not know about a transient outage between the client and storage | |||
device because it has no presence on the given subnet. | device because it has no presence on the given subnet. | |||
As such, it is the client which decides which mirror to access for | As such, it is the client which decides which mirror to access for | |||
reading the file. The requirements for writing to a mirrored layout | reading the file. The requirements for writing to mirrored layout | |||
segments are presented below. | segments are presented below. | |||
8.2. Writing to Mirrors | 8.2. Writing to Mirrors | |||
8.2.1. Single Storage Device Updates Mirrors | 8.2.1. Single Storage Device Updates Mirrors | |||
If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client | If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client | |||
only needs to update one of the copies of the layout segment. For | only needs to update one of the copies of the layout segment. For | |||
this case, the storage device MUST ensure that all copies of the | this case, the storage device MUST ensure that all copies of the | |||
mirror are updated when any one of the mirrors is updated. If the | mirror are updated when any one of the mirrors is updated. If the | |||
storage device gets an error when updating one of the mirrors, then | storage device gets an error when updating one of the mirrors, then | |||
it MUST inform the client that the original WRITE had an error. The | it MUST inform the client that the original WRITE had an error. The | |||
client then MUST inform the metadata server (see Section 8.2.3. The | client then MUST inform the metadata server (see Section 8.2.3). The | |||
client's responsibility with resepect to COMMIT is explained in | client's responsibility with respect to COMMIT is explained in | |||
Section 8.2.4. The client may choose any one of the mirrors and may | Section 8.2.4. The client may choose any one of the mirrors and may | |||
use ffds_efficiency in the same manner as for reading when making | use ffds_efficiency in the same manner as for reading when making | |||
this choice. | this choice. | |||
8.2.2. Single Storage Device Updates Mirrors | 8.2.2. Single Storage Device Updates Mirrors | |||
If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the | If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the | |||
client is responsible for updating all mirrored copies of the layout | client is responsible for updating all mirrored copies of the layout | |||
segments that it is given in the layout. A single failed update is | segments that it is given in the layout. A single failed update is | |||
sufficient to fail the entire operation. If all but one copy is | sufficient to fail the entire operation. If all but one copy is | |||
skipping to change at page 27, line 16 ¶ | skipping to change at page 26, line 40 ¶ | |||
When stable writes are done to the metadata server or to a single | When stable writes are done to the metadata server or to a single | |||
replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR ), it is | replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR ), it is | |||
the responsibility of the receiving node to propagate the written | the responsibility of the receiving node to propagate the written | |||
data stably, before replying to the client. | data stably, before replying to the client. | |||
In the corresponding cases in which unstable writes are done, the | In the corresponding cases in which unstable writes are done, the | |||
receiving node does not have any such obligation, although it may | receiving node does not have any such obligation, although it may | |||
choose to asynchronously propagate the updates. However, once a | choose to asynchronously propagate the updates. However, once a | |||
COMMIT is replied to, all replicas must reflect the writes that have | COMMIT is replied to, all replicas must reflect the writes that have | |||
been done, and these data must have been committed to stable storage | been done, and this data must have been committed to stable storage | |||
on all replicas. | on all replicas. | |||
In order to avoid situations in which stale data is read from | In order to avoid situations in which stale data is read from | |||
replicas to which writes have not been propagated: | replicas to which writes have not been propagated: | |||
o A client which has outstanding unstable writes made to single node | o A client which has outstanding unstable writes made to single node | |||
(metadata server or storage device) MUST do all reads from that | (metadata server or storage device) MUST do all reads from that | |||
same node. | same node. | |||
o When writes are flushed to the server, for example to implement, | o When writes are flushed to the server, for example to implement, | |||
skipping to change at page 30, line 22 ¶ | skipping to change at page 29, line 47 ¶ | |||
/// uint64_t ffil_bytes_completed; | /// uint64_t ffil_bytes_completed; | |||
/// uint64_t ffil_bytes_not_delivered; | /// uint64_t ffil_bytes_not_delivered; | |||
/// nfstime4 ffil_total_busy_time; | /// nfstime4 ffil_total_busy_time; | |||
/// nfstime4 ffil_aggregate_completion_time; | /// nfstime4 ffil_aggregate_completion_time; | |||
/// }; | /// }; | |||
/// | /// | |||
<CODE ENDS> | <CODE ENDS> | |||
Both operation counts and bytes transferred are kept in the | Both operation counts and bytes transferred are kept in the | |||
ff_io_latency4. READ operations are used for read latencies. Both | ff_io_latency4. As seen in ff_layoutupdate4 (See Section 9.2.2) read | |||
WRITE and COMMIT operations are used for write latencies. | and write operations are aggregated separately. READ operations are | |||
"Requested" counters track what the client is attempting to do and | used for the ff_io_latency4 ffl_read. Both WRITE and COMMIT | |||
"completed" counters track what was done. Note that there is no | operations are used for the ff_io_latency4 ffl_write. "Requested" | |||
requirement that the client only report completed results that have | counters track what the client is attempting to do and "completed" | |||
matching requested results from the reported period. | counters track what was done. There is no requirement that the | |||
client only report completed results that have matching requested | ||||
results from the reported period. | ||||
ffil_bytes_not_delivered is used to track the aggregate number of | ffil_bytes_not_delivered is used to track the aggregate number of | |||
bytes requested by not fulfilled due to error conditions. | bytes requested by not fulfilled due to error conditions. | |||
ffil_total_busy_time is the aggregate time spent with outstanding RPC | ffil_total_busy_time is the aggregate time spent with outstanding RPC | |||
calls, ffil_aggregate_completion_time is the sum of all latencies for | calls. ffil_aggregate_completion_time is the sum of all round trip | |||
completed RPC calls. | times for completed RPC calls. | |||
In Section 3.3.1 of [RFC5661], the nfstime4 is defined as the number | ||||
of seconds and nanoseconds since midnight or zero hour January 1, | ||||
1970 Coordinated Universal Time (UTC). The use of nfstime4 in | ||||
ff_io_latency4 is to store time since the start of the first I/O from | ||||
the client after receiving the layout. In other words, these are to | ||||
be decoded as duration and not as a date and time. | ||||
Note that LAYOUTSTATS are cumulative, i.e., not reset each time the | Note that LAYOUTSTATS are cumulative, i.e., not reset each time the | |||
operation is sent. If two LAYOUTSTATS ops for the same file, layout | operation is sent. If two LAYOUTSTATS ops for the same file, layout | |||
stateid, and originating from the same NFS client are processed at | stateid, and originating from the same NFS client are processed at | |||
the same time by the metadata server, then the one containing the | the same time by the metadata server, then the one containing the | |||
larger values contains the most recent time series data. | larger values contains the most recent time series data. | |||
9.2.2. ff_layoutupdate4 | 9.2.2. ff_layoutupdate4 | |||
<CODE BEGINS> | <CODE BEGINS> | |||
skipping to change at page 36, line 15 ¶ | skipping to change at page 35, line 44 ¶ | |||
[RFC5661]). To avoid data corruption, the metadata server MUST fence | [RFC5661]). To avoid data corruption, the metadata server MUST fence | |||
off the revoked clients from the respective data files as described | off the revoked clients from the respective data files as described | |||
in Section 2.2. | in Section 2.2. | |||
15. Security Considerations | 15. Security Considerations | |||
The pNFS extension partitions the NFSv4.1+ file system protocol into | The pNFS extension partitions the NFSv4.1+ file system protocol into | |||
two parts, the control path and the data path (storage protocol). | two parts, the control path and the data path (storage protocol). | |||
The control path contains all the new operations described by this | The control path contains all the new operations described by this | |||
extension; all existing NFSv4 security mechanisms and features apply | extension; all existing NFSv4 security mechanisms and features apply | |||
to the control path. The combination of components in a pNFS system | to the control path (see Sections 1.7.1 and 2.2.1 of [RFC5661]). The | |||
is required to preserve the security properties of NFSv4.1+ with | combination of components in a pNFS system is required to preserve | |||
respect to an entity accessing data via a client, including security | the security properties of NFSv4.1+ with respect to an entity | |||
countermeasures to defend against threats that NFSv4.1+ provides | accessing data via a client, including security countermeasures to | |||
defenses for in environments where these threats are considered | defend against threats that NFSv4.1+ provides defenses for in | |||
significant. | environments where these threats are considered significant. | |||
The metadata server enforces the file access-control policy at | The metadata server enforces the file access-control policy at | |||
LAYOUTGET time. The client should use RPC authorization credentials | LAYOUTGET time. The client should use RPC authorization credentials | |||
(uid/gid for AUTH_SYS or tickets for Kerberos) for getting the layout | for getting the layout for the requested iomode (READ or RW) and the | |||
for the requested iomode (READ or RW) and the server verifies the | server verifies the permissions and ACL for these credentials, | |||
permissions and ACL for these credentials, possibly returning | possibly returning NFS4ERR_ACCESS if the client is not allowed the | |||
NFS4ERR_ACCESS if the client is not allowed the requested iomode. If | requested iomode. If the LAYOUTGET operation succeeds the client | |||
the LAYOUTGET operation succeeds the client receives, as part of the | receives, as part of the layout, a set of credentials allowing it I/O | |||
layout, a set of credentials allowing it I/O access to the specified | access to the specified data files corresponding to the requested | |||
data files corresponding to the requested iomode. When the client | iomode. When the client acts on I/O operations on behalf of its | |||
acts on I/O operations on behalf of its local users, it MUST | local users, it MUST authenticate and authorize the user by issuing | |||
authenticate and authorize the user by issuing respective OPEN and | respective OPEN and ACCESS calls to the metadata server, similar to | |||
ACCESS calls to the metadata server, similar to having NFSv4 data | having NFSv4 data delegations. | |||
delegations. | ||||
If access is allowed, the client uses the corresponding (READ or RW) | If access is allowed, the client uses the corresponding (READ or RW) | |||
credentials to perform the I/O operations at the data file's storage | credentials to perform the I/O operations at the data file's storage | |||
devices. When the metadata server receives a request to change a | devices. When the metadata server receives a request to change a | |||
file's permissions or ACL, it SHOULD recall all layouts for that file | file's permissions or ACL, it SHOULD recall all layouts for that file | |||
and then MUST fence off any clients still holding outstanding layouts | and then MUST fence off any clients still holding outstanding layouts | |||
for the respective files by implicitly invalidating the previously | for the respective files by implicitly invalidating the previously | |||
distributed credential on all data file comprising the file in | distributed credential on all data file comprising the file in | |||
question. It is REQUIRED that this be done before committing to the | question. It is REQUIRED that this be done before committing to the | |||
new permissions and/or ACL. By requesting new layouts, the clients | new permissions and/or ACL. By requesting new layouts, the clients | |||
will reauthorize access against the modified access control metadata. | will reauthorize access against the modified access control metadata. | |||
Recalling the layouts in this case is intended to prevent clients | Recalling the layouts in this case is intended to prevent clients | |||
from getting an error on I/Os done after the client was fenced off. | from getting an error on I/Os done after the client was fenced off. | |||
15.1. Kerberized File Access | 15.1. RPCSEC_GSS and Security Services | |||
15.1.1. Loosely Coupled | 15.1.1. Loosely Coupled | |||
RPCSEC_GSS version 3 (RPCSEC_GSSv3) [RFC7861] could be used to | RPCSEC_GSS version 3 (RPCSEC_GSSv3) [RFC7861] could be used to | |||
authorize the client to the storage device on behalf of the metadata | authorize the client to the storage device on behalf of the metadata | |||
server. This would require that each of the metadata server, storage | server. This would require that each of the metadata server, storage | |||
device, and client would have to implement RPCSEC_GSSv3. The second | device, and client would have to implement RPCSEC_GSSv3 via an RPC- | |||
requirement does not match the intent of the loosely coupled model | application-defined structured privilege assertion in a manner | |||
that the storage device need not be modified. | described in Section 4.9.1 of [RFC7862]. These requirements do not | |||
match the intent of the loosely coupled model that the storage device | ||||
Under this coupling model, the principal used to authenticate the | need not be modified. (Note that this does not preclude the use of | |||
metadata file is different than that used to authenticate the data | RPCSEC_GSSv3 in a loosely coupled model.) | |||
file. For the metadata server, the user credentials would be | ||||
generated by the same Kerberos server as the client. However, for | ||||
the data storage access, the metadata server would generate the | ||||
ticket granting tickets and provide them to the client. Fencing | ||||
would then be controlled either by expiring the ticket or by | ||||
modifying the syntethic uid or gid on the data file. | ||||
15.1.2. Tightly Coupled | 15.1.2. Tightly Coupled | |||
With tight coupling, the principal used to access the metadata file | With tight coupling, the principal used to access the metadata file | |||
is exactly the same as used to access the data file. As a result | is exactly the same as used to access the data file. The storage | |||
there are no security issues related to using Kerberos with a tightly | device can use the control protocol to validate any RPC credentials. | |||
coupled system. | As a result there are no security issues related to using RPCSEC_GSS | |||
with a tightly coupled system. For example, if Kerberos V5 GSS-API | ||||
[RFC4121] is used as the security mechanism, then the storage device | ||||
could use a control protocol to validate the RPC credentials to the | ||||
metadata server. | ||||
16. IANA Considerations | 16. IANA Considerations | |||
[RFC5661] introduced a registry for "pNFS Layout Types Registry" and | [RFC5661] introduced a registry for "pNFS Layout Types Registry" and | |||
as such, new layout type numbers need to be assigned by IANA. This | as such, new layout type numbers need to be assigned by IANA. This | |||
document defines the protocol associated with the existing layout | document defines the protocol associated with the existing layout | |||
type number, LAYOUT4_FLEX_FILES (see Table 1). | type number, LAYOUT4_FLEX_FILES (see Table 1). | |||
+--------------------+-------+----------+-----+----------------+ | +--------------------+-------+----------+-----+----------------+ | |||
| Layout Type Name | Value | RFC | How | Minor Versions | | | Layout Type Name | Value | RFC | How | Minor Versions | | |||
skipping to change at page 38, line 43 ¶ | skipping to change at page 38, line 19 ¶ | |||
[LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", | [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", | |||
November 2008, <http://trustee.ietf.org/docs/ | November 2008, <http://trustee.ietf.org/docs/ | |||
IETF-Trust-License-Policy.pdf>. | IETF-Trust-License-Policy.pdf>. | |||
[RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, | [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, | |||
June 1995. | June 1995. | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
[RFC4121] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos | ||||
Version 5 Generic Security Service Application Program | ||||
Interface (GSS-API) Mechanism Version 2", RFC 4121, July | ||||
2005. | ||||
[RFC4506] Eisler, M., "XDR: External Data Representation Standard", | [RFC4506] Eisler, M., "XDR: External Data Representation Standard", | |||
STD 67, RFC 4506, May 2006. | STD 67, RFC 4506, May 2006. | |||
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol | [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol | |||
Specification Version 2", RFC 5531, May 2009. | Specification Version 2", RFC 5531, May 2009. | |||
[RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., | |||
"Network File System (NFS) Version 4 Minor Version 1 | "Network File System (NFS) Version 4 Minor Version 1 | |||
Protocol", RFC 5661, January 2010. | Protocol", RFC 5661, January 2010. | |||
skipping to change at page 39, line 18 ¶ | skipping to change at page 38, line 47 ¶ | |||
RFC 5662, January 2010. | RFC 5662, January 2010. | |||
[RFC7530] Haynes, T. and D. Noveck, "Network File System (NFS) | [RFC7530] Haynes, T. and D. Noveck, "Network File System (NFS) | |||
version 4 Protocol", RFC 7530, March 2015. | version 4 Protocol", RFC 7530, March 2015. | |||
[RFC7862] Haynes, T., "NFS Version 4 Minor Version 2", RFC 7862, | [RFC7862] Haynes, T., "NFS Version 4 Minor Version 2", RFC 7862, | |||
November 2016. | November 2016. | |||
[pNFSLayouts] | [pNFSLayouts] | |||
Haynes, T., "Requirements for pNFS Layout Types", draft- | Haynes, T., "Requirements for pNFS Layout Types", draft- | |||
ietf-nfsv4-layout-types-04 (Work In Progress), January | ietf-nfsv4-layout-types-05 (Work In Progress), July 2017. | |||
2016. | ||||
17.2. Informative References | 17.2. Informative References | |||
[RFC4519] Sciberras, A., Ed., "Lightweight Directory Access Protocol | [RFC4519] Sciberras, A., Ed., "Lightweight Directory Access Protocol | |||
(LDAP): Schema for User Applications", RFC 4519, DOI | (LDAP): Schema for User Applications", RFC 4519, DOI | |||
10.17487/RFC4519, June 2006, | 10.17487/RFC4519, June 2006, | |||
<http://www.rfc-editor.org/info/rfc4519>. | <http://www.rfc-editor.org/info/rfc4519>. | |||
[RFC7861] Adamson, W. and N. Williams, "Remote Procedure Call (RPC) | [RFC7861] Adamson, W. and N. Williams, "Remote Procedure Call (RPC) | |||
Security Version 3", November 2016. | Security Version 3", November 2016. | |||
Appendix A. Acknowledgments | Appendix A. Acknowledgments | |||
Those who provided miscellaneous comments to early drafts of this | Those who provided miscellaneous comments to early drafts of this | |||
document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, | document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, | |||
and Lev Solomonov. | and Lev Solomonov. | |||
Those who provided miscellaneous comments to the final drafts of this | Those who provided miscellaneous comments to the final drafts of this | |||
document include: Anand Ganesh, Robert Wipfel, Gobikrishnan | document include: Anand Ganesh, Robert Wipfel, Gobikrishnan | |||
Sundharraj, Trond Myklebust, and Rick Macklem. | Sundharraj, Trond Myklebust, Rick Macklem, and Jim Sermersheim. | |||
Idan Kedar caught a nasty bug in the interaction of client side | Idan Kedar caught a nasty bug in the interaction of client side | |||
mirroring and the minor versioning of devices. | mirroring and the minor versioning of devices. | |||
Dave Noveck provided comprehensive reviews of the document during the | Dave Noveck provided comprehensive reviews of the document during the | |||
working group last calls. He also rewrote Section 2.3. | working group last calls. He also rewrote Section 2.3. | |||
Olga Kornievskaiaa made a convincing case against the use of a | Olga Kornievskaia made a convincing case against the use of a | |||
credential versus a principal in the fencing approach. Andy Adamson | credential versus a principal in the fencing approach. Andy Adamson | |||
and Benjamin Kaduk helped to sharpen the focus. | and Benjamin Kaduk helped to sharpen the focus. | |||
Benjamin Kaduk and Olga Kornievskaia also helped provide concrete | ||||
scenarios for loosely coupled security mechanisms. And in the end, | ||||
Olga proved that as defined, the loosely coupled model would not work | ||||
with RPCSEC_GSS. | ||||
Tigran Mkrtchyan provided the use case for not allowing the client to | Tigran Mkrtchyan provided the use case for not allowing the client to | |||
proxy the I/O through the data server. | proxy the I/O through the data server. | |||
Rick Macklem provided the use case for only writing to a single | Rick Macklem provided the use case for only writing to a single | |||
mirror. | mirror. | |||
Appendix B. RFC Editor Notes | Appendix B. RFC Editor Notes | |||
[RFC Editor: please remove this section prior to publishing this | [RFC Editor: please remove this section prior to publishing this | |||
document as an RFC] | document as an RFC] | |||
End of changes. 50 change blocks. | ||||
134 lines changed or deleted | 133 lines changed or added | |||
This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |