draft-ietf-nfsv4-flex-files-09.txt | draft-ietf-nfsv4-flex-files-10.txt | |||
---|---|---|---|---|
NFSv4 B. Halevy | NFSv4 B. Halevy | |||
Internet-Draft | Internet-Draft | |||
Intended status: Standards Track T. Haynes | Intended status: Standards Track T. Haynes | |||
Expires: November 10, 2017 Primary Data | Expires: January 18, 2018 Primary Data | |||
May 09, 2017 | July 17, 2017 | |||
Parallel NFS (pNFS) Flexible File Layout | Parallel NFS (pNFS) Flexible File Layout | |||
draft-ietf-nfsv4-flex-files-09.txt | draft-ietf-nfsv4-flex-files-10.txt | |||
Abstract | Abstract | |||
The Parallel Network File System (pNFS) allows a separation between | The Parallel Network File System (pNFS) allows a separation between | |||
the metadata (onto a metadata server) and data (onto a storage | the metadata (onto a metadata server) and data (onto a storage | |||
device) for a file. The flexible file layout type is defined in this | device) for a file. The flexible file layout type is defined in this | |||
document as an extension to pNFS which allows the use of storage | document as an extension to pNFS which allows the use of storage | |||
devices in a fashion such that they require only a quite limited | devices in a fashion such that they require only a quite limited | |||
degree of interaction with the metadata server, using already | degree of interaction with the metadata server, using already | |||
existing protocols. Client side mirroring is also added to provide | existing protocols. Client side mirroring is also added to provide | |||
skipping to change at page 1, line 38 ¶ | skipping to change at page 1, line 38 ¶ | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on November 10, 2017. | This Internet-Draft will expire on January 18, 2018. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2017 IETF Trust and the persons identified as the | Copyright (c) 2017 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 2, line 17 ¶ | skipping to change at page 2, line 17 ¶ | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 | 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
1.2. Difference Between a Data Server and a Storage Device . . 5 | 1.2. Difference Between a Data Server and a Storage Device . . 5 | |||
1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6 | 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6 | |||
2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 | 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 | |||
2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 | 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 | |||
2.2. Fencing Clients from the Data Server . . . . . . . . . . 6 | 2.2. Fencing Clients from the Storage Device . . . . . . . . . 6 | |||
2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7 | 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7 | |||
2.2.2. Example of using Synthetic uids/gids . . . . . . . . 7 | 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 8 | |||
2.3. State and Locking Models . . . . . . . . . . . . . . . . 8 | 2.3. State and Locking Models . . . . . . . . . . . . . . . . 9 | |||
3. XDR Description of the Flexible File Layout Type . . . . . . 9 | 2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 9 | |||
3.1. Code Components Licensing Notice . . . . . . . . . . . . 10 | 2.3.2. Tighly Coupled Locking Model . . . . . . . . . . . . 10 | |||
4. Device Addressing and Discovery . . . . . . . . . . . . . . . 11 | 3. XDR Description of the Flexible File Layout Type . . . . . . 12 | |||
4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 11 | 3.1. Code Components Licensing Notice . . . . . . . . . . . . 13 | |||
4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 13 | 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 14 | |||
5. Flexible File Layout type . . . . . . . . . . . . . . . . . . 14 | 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 14 | |||
5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 14 | 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 16 | |||
5.1.1. Error codes from LAYOUTGET . . . . . . . . . . . . . 18 | 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 17 | |||
5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 18 | 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 17 | |||
5.2. Interactions Between Devices and Layouts . . . . . . . . 18 | 5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 21 | |||
5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 19 | 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 21 | |||
6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 19 | 5.2. Interactions Between Devices and Layouts . . . . . . . . 22 | |||
7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 20 | 5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 22 | |||
8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 20 | 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 23 | |||
8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 21 | 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 23 | |||
8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 21 | 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 24 | |||
8.3. Metadata Server Resilvering of the File . . . . . . . . . 22 | 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 24 | |||
9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 22 | 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 25 | |||
9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 23 | 8.3. Metadata Server Resilvering of the File . . . . . . . . . 26 | |||
9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 23 | 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 26 | |||
9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 24 | 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 27 | |||
9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 24 | 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 27 | |||
9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 25 | 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 28 | |||
9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 26 | 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 28 | |||
9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 27 | 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 29 | |||
10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 27 | 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 29 | |||
11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 27 | 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 31 | |||
12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 28 | 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 31 | |||
12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 28 | 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 31 | |||
13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 29 | 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 32 | |||
13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 29 | 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 32 | |||
14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 30 | 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 33 | |||
15. Security Considerations . . . . . . . . . . . . . . . . . . . 30 | 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 33 | |||
15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 31 | 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 34 | |||
15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 31 | 15. Security Considerations . . . . . . . . . . . . . . . . . . . 34 | |||
15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 31 | 15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 35 | |||
16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 | 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 35 | |||
17. References . . . . . . . . . . . . . . . . . . . . . . . . . 32 | 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 35 | |||
17.1. Normative References . . . . . . . . . . . . . . . . . . 32 | 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 36 | |||
17.2. Informative References . . . . . . . . . . . . . . . . . 33 | 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 36 | |||
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 33 | 17.1. Normative References . . . . . . . . . . . . . . . . . . 36 | |||
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 33 | 17.2. Informative References . . . . . . . . . . . . . . . . . 37 | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33 | Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 37 | |||
Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 37 | ||||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 37 | ||||
1. Introduction | 1. Introduction | |||
In the parallel Network File System (pNFS), the metadata server | In the parallel Network File System (pNFS), the metadata server | |||
returns layout type structures that describe where file data is | returns layout type structures that describe where file data is | |||
located. There are different layout types for different storage | located. There are different layout types for different storage | |||
systems and methods of arranging data on storage devices. This | systems and methods of arranging data on storage devices. This | |||
document defines the flexible file layout type used with file-based | document defines the flexible file layout type used with file-based | |||
data servers that are accessed using the Network File System (NFS) | data servers that are accessed using the Network File System (NFS) | |||
protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC5661], and | protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC5661], and | |||
skipping to change at page 6, line 22 ¶ | skipping to change at page 6, line 25 ¶ | |||
The coupling of the metadata server with the storage devices can be | The coupling of the metadata server with the storage devices can be | |||
either tight or loose. In a tight coupling, there is a control | either tight or loose. In a tight coupling, there is a control | |||
protocol present to manage security, LAYOUTCOMMITs, etc. With a | protocol present to manage security, LAYOUTCOMMITs, etc. With a | |||
loose coupling, the only control protocol might be a version of NFS. | loose coupling, the only control protocol might be a version of NFS. | |||
As such, semantics for managing security, state, and locking models | As such, semantics for managing security, state, and locking models | |||
MUST be defined. | MUST be defined. | |||
2.1. LAYOUTCOMMIT | 2.1. LAYOUTCOMMIT | |||
With a tightly coupled system, when the metadata server receives a | When tightly coupled storage devices are used, the metadata server | |||
LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the | has the responsibility, upon receiving a LAYOUTCOMMIT (see | |||
file layout type MUST be met (see Section 12.5.4 of [RFC5661]). It | Section 18.42 of [RFC5661]), of ensuring that the semantics of pNFS | |||
is the responsibility of the client to make sure the data file is | are respected (see Section 12.5.4 of [RFC5661]). These do not | |||
stable before the metadata server begins to query the storage devices | include a requirement that data written to data storage device be | |||
about the changes to the file. With a loosely coupled system, if any | stable upon completion of the LAYOUTCOMMIT. | |||
WRITE to a storage device did not result with stable_how equal to | ||||
FILE_SYNC, a LAYOUTCOMMIT to the metadata server MUST be preceded | ||||
with a COMMIT to the storage device. Note that if the client has not | ||||
done a COMMIT to the storage device, then the LAYOUTCOMMIT might not | ||||
be synchronized to the last WRITE operation to the storage device. | ||||
2.2. Fencing Clients from the Data Server | In the case of loosely coupled storage devices, it is the | |||
responsibility of the client to make sure the data file is stable | ||||
before the metadata server begins to query the storage devices about | ||||
the changes to the file. If any WRITE to a storage device did not | ||||
result with stable_how equal to FILE_SYNC, a LAYOUTCOMMIT to the | ||||
metadata server MUST be preceded by a COMMIT to the storage devices | ||||
written to. Note that if the client has not done a COMMIT to the | ||||
storage device, then the LAYOUTCOMMIT might not be synchronized to | ||||
the last WRITE operation to the storage device. | ||||
2.2. Fencing Clients from the Storage Device | ||||
With loosely coupled storage devices, the metadata server uses | With loosely coupled storage devices, the metadata server uses | |||
synthetic uids and gids for the data file, where the uid owner of the | synthetic uids and gids for the data file, where the uid owner of the | |||
data file is allowed read/write access and the gid owner is allowed | data file is allowed read/write access and the gid owner is allowed | |||
read only access. As part of the layout (see ffds_user and | read only access. As part of the layout (see ffds_user and | |||
ffds_group in Section 5.1), the client is provided with the user and | ffds_group in Section 5.1), the client is provided with the user and | |||
group to be used in the Remote Procedure Call (RPC) [RFC5531] | group to be used in the Remote Procedure Call (RPC) [RFC5531] | |||
credentials needed to access the data file. Fencing off of clients | credentials needed to access the data file. Fencing off of clients | |||
is achieved by the metadata server changing the synthetic uid and/or | is achieved by the metadata server changing the synthetic uid and/or | |||
gid owners of the data file on the storage device to implicitly | gid owners of the data file on the storage device to implicitly | |||
skipping to change at page 7, line 18 ¶ | skipping to change at page 7, line 26 ¶ | |||
all directories holding data files to the root user. This approach | all directories holding data files to the root user. This approach | |||
provides a practical model to enforce access control and fence off | provides a practical model to enforce access control and fence off | |||
cooperative clients, but it can not protect against malicious | cooperative clients, but it can not protect against malicious | |||
clients; hence it provides a level of security equivalent to | clients; hence it provides a level of security equivalent to | |||
AUTH_SYS. | AUTH_SYS. | |||
With tightly coupled storage devices, the metadata server sets the | With tightly coupled storage devices, the metadata server sets the | |||
user and group owners, mode bits, and ACL of the data file to be the | user and group owners, mode bits, and ACL of the data file to be the | |||
same as the metadata file. And the client must authenticate with the | same as the metadata file. And the client must authenticate with the | |||
storage device and go through the same authorization process it would | storage device and go through the same authorization process it would | |||
go through via the metadata server. | go through via the metadata server. In the case of tight coupling, | |||
fencing is the responsibility of the control protocol and is not | ||||
described in detail here. However, implementations of the tight | ||||
coupling locking model (see Section 2.3), will need a way to prevent | ||||
access by certain clients to specific files by invalidating the | ||||
corresponding stateids on the storage device. | ||||
2.2.1. Implementation Notes for Synthetic uids/gids | 2.2.1. Implementation Notes for Synthetic uids/gids | |||
The selection method for the synthetic uids and gids to be used for | The selection method for the synthetic uids and gids to be used for | |||
fencing in loosely coupled storage devices is strictly an | fencing in loosely coupled storage devices is strictly an | |||
implementation issue. I.e., an administrator might restrict a range | implementation issue. I.e., an administrator might restrict a range | |||
of such ids available to the Lightweight Directory Access Protocol | of such ids available to the Lightweight Directory Access Protocol | |||
(LDAP) 'uid' field [RFC4519]. She might also be able to choose an id | (LDAP) 'uid' field [RFC4519]. She might also be able to choose an id | |||
that would never be used to grant acccess. Then when the metadata | that would never be used to grant acccess. Then when the metadata | |||
server had a request to access a file, a SETATTR would be sent to the | server had a request to access a file, a SETATTR would be sent to the | |||
skipping to change at page 7, line 44 ¶ | skipping to change at page 8, line 10 ¶ | |||
client. And it would present them as the RPC credentials to the | client. And it would present them as the RPC credentials to the | |||
storage device. When the client was done accessing the file and the | storage device. When the client was done accessing the file and the | |||
metadata server knew that no other client was accessing the file, it | metadata server knew that no other client was accessing the file, it | |||
could reset the owner and group to restrict access to the data file. | could reset the owner and group to restrict access to the data file. | |||
When the metadata server wanted to fence off a client, it would | When the metadata server wanted to fence off a client, it would | |||
change the synthetic uid and/or gid to the restricted ids. Note that | change the synthetic uid and/or gid to the restricted ids. Note that | |||
using a restricted id ensures that there is a change of owner and at | using a restricted id ensures that there is a change of owner and at | |||
least one id available that never gets allowed access. | least one id available that never gets allowed access. | |||
Under an AUTH_SYS security model, synthetic uids and gids of 0 SHOULD | ||||
be avoided. These typically either grant super access to files on a | ||||
storage device or are mapped to an anonymous id. In the first case, | ||||
even if the data file is fenced, the client might still be able to | ||||
access the file. In the second case, multiple ids might be mapped to | ||||
the anonymous ids. | ||||
2.2.2. Example of using Synthetic uids/gids | 2.2.2. Example of using Synthetic uids/gids | |||
The user loghyr creates a file "ompha.c" on the metadata server and | The user loghyr creates a file "ompha.c" on the metadata server and | |||
it creates a corresponding data file on the storage device. | it creates a corresponding data file on the storage device. | |||
The metadata server entry may look like: | The metadata server entry may look like: | |||
-rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c | -rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c | |||
On the storage device, it may be assigned some random synthetic uid/ | On the storage device, it may be assigned some random synthetic uid/ | |||
gid to deny access: | gid to deny access: | |||
-rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c | -rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c | |||
When the file is opened on a client, since the layout knows nothing | When the file is opened on a client, since the layout knows nothing | |||
about the user (and does not care), whether loghyr or garbo opens the | about the user (and does not care), whether loghyr or garbo opens the | |||
file does not matter. The owner and group are modified and those | file does not matter. The owner and group are modified and those | |||
values are returned. | values are returned. | |||
skipping to change at page 8, line 41 ¶ | skipping to change at page 9, line 14 ¶ | |||
While pushing the enforcement of permission checking onto the client | While pushing the enforcement of permission checking onto the client | |||
may seem to weaken security, the client may already be responsible | may seem to weaken security, the client may already be responsible | |||
for enforcing permissions before modifications are sent to a server. | for enforcing permissions before modifications are sent to a server. | |||
With cached writes, the client is always responsible for tracking who | With cached writes, the client is always responsible for tracking who | |||
is modifying a file and making sure to not coalesce requests from | is modifying a file and making sure to not coalesce requests from | |||
multiple users into one request. | multiple users into one request. | |||
2.3. State and Locking Models | 2.3. State and Locking Models | |||
Metadata file OPEN, LOCK, and DELEGATION operations are always | The choice of locking models is governed by the following rules: | |||
executed only against the metadata server. | ||||
The metadata server responds to state changing operations by | o Storage devices implementing the NFSv3 and NFSv4.0 protocols are | |||
executing them against the respective data files on the storage | always treated as loosely coupled. | |||
devices. It then sends the storage device open stateid as part of | ||||
the layout (see the ffm_stateid in Section 5.1) and it is then used | o NFSv4.1+ storage devices that do not return the | |||
by the client for executing READ/WRITE operations against the storage | EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating | |||
that they are to be treated as loosely coupled. From the locking | ||||
viewpoint they are treated in the same way as NFSv4.0 storage | ||||
devices. | ||||
o NFSv4.1+ storage devices that do identify themselves with the | ||||
EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are considered | ||||
strongly coupled. They would use a back-end control protocol to | ||||
implement the global stateid model as described in [RFC5661]. | ||||
2.3.1. Loosely Coupled Locking Model | ||||
When locking-related operations are requested, they are primarily | ||||
dealt with by the metadata server, which generates the appropriate | ||||
stateids. When an NFSv4 version is used as the data access protocol, | ||||
the metadata server may make stateid-related requests of the storage | ||||
devices. However, it is not required to do so and the resulting | ||||
stateids are known only to the metadata server and the storage | ||||
device. | device. | |||
NFSv4.1+ storage devices that do not return the | Given this basic structure, locking-related operations are handled as | |||
EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating that | follows: | |||
they are loosely coupled. As such, they are treated the same way as | ||||
NFSv4 storage devices. | ||||
NFSv4.1+ storage devices that do identify themselves with the | o OPENs are dealt with by the metadata server. Stateids are | |||
EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are stongly | selected by the metadata server and associated with the client id | |||
coupled. They will be using a back-end control protocol as described | describing the client's connection to the metadata server. The | |||
in [RFC5661] to implement a global stateid model as defined there. | metadata server may need to interact with the storage device to | |||
locate the file to be opened, but no locking-related functionality | ||||
need be used on the storage device. | ||||
OPEN_DOWNGRADE and CLOSE only require local execution on the | ||||
metadata sever. | ||||
o Advisory byte-range locks can be implemented locally on the | ||||
metadata server. As in the case of OPENs, the stateids associated | ||||
with byte-range locks are assigned by the metadata server and only | ||||
used on the metadata server. | ||||
o Delegations are assigned by the metadata server which initiates | ||||
recalls when conflicting OPENs are processed. No storage device | ||||
involvement is required. | ||||
o TEST_STATEID and FREE_STATEID are processed locally on the | ||||
metadata server, without storage device involvement. | ||||
All I/O operations to the storage device are done using the anonymous | ||||
stateid. Thus the storage device has no information about the | ||||
openowner and lockowner responsible for issuing a particular I/O | ||||
operation. As a result: | ||||
o Mandatory byte-range locking cannot be supported because the | ||||
storage device has no way of distinguishing I/O done on behalf of | ||||
the lock owner from those done by others. | ||||
o Enforcement of share reservations is the responsibility of the | ||||
client. Even though I/O is done using the anonymous stateid, the | ||||
client must ensure that it has a valid stateid associated with the | ||||
openowner, that allows the I/O being done before issuing the I/O. | ||||
In the event that a stateid is revoked, the metadata server is | ||||
responsible for preventing client access, since it has no way of | ||||
being sure that the client is aware that the stateid in question has | ||||
been revoked. | ||||
As the client never receives a stateid generated by a storage device, | ||||
there is no client lease on the storage device and no prospect of | ||||
lease expiration, even when access is via NFSv4 protocols. Clients | ||||
will have leases on the metadata server. In dealing with lease | ||||
expiration, the metadata server may need to use fencing to prevent | ||||
revoked stateids from being relied upon by a client unaware of the | ||||
fact that they have been revoked. | ||||
2.3.2. Tighly Coupled Locking Model | ||||
When locking-related operations are requested, they are primarily | ||||
dealt with by the metadata server, which generates the appropriate | ||||
stateids. These stateids must be made known to the storage device | ||||
using control protocol facilities, the details of which are not | ||||
discussed in this document. | ||||
Given this basic structure, locking-related operations are handled as | ||||
follows: | ||||
o OPENs are dealt with primarily on the metadata server. Stateids | ||||
are selected by the metadata server and associated with the client | ||||
id describing the client's connection to the metadata server. The | ||||
metadata server needs to interact with the storage device to | ||||
locate the file to be opened, and to make the storage device aware | ||||
of the association between the metadata-sever-chosen stateid and | ||||
the client and openowner that it represents. | ||||
OPEN_DOWNGRADE and CLOSE are executed initially on the metadata | ||||
server but the state change made must be propagated to the storage | ||||
device. | ||||
o Advisory byte-range locks can be implemented locally on the | ||||
metadata server. As in the case of OPENs, the stateids associated | ||||
with byte-range locks, are assigned by the metadata server and are | ||||
available for use on the metadata server. Because I/O operations | ||||
are allowed to present lock stateids, the metadata server needs | ||||
the ability to make the storage device aware of the association | ||||
between the metadata-sever-chosen stateid and the corresponding | ||||
open stateid it is associated with. | ||||
o Mandatory byte-range locks can be supported when both the metadata | ||||
server and the storage devices have the appropriate support. As | ||||
in the case of advisory byte-range locks, these are assigned by | ||||
the metadata server and are available for use on the metadata | ||||
server. To enable mandatory lock enforcement on the storage | ||||
device, the metadata server needs the ability to make the storage | ||||
device aware of the association between the metadata-sever-chosen | ||||
stateid and the client, openowner, and lock (i.e., lockowner, | ||||
byte-range, lock-type) that it represents. Because I/O operations | ||||
are allowed to present lock stateids, this information needs to be | ||||
propagated to all storage devices to which I/O might be directed | ||||
rather than only to daya storage device that contain the locked | ||||
region. | ||||
o Delegations are assigned by the metadata server which initiates | ||||
recalls when conflicting OPENs are processed. Because I/O | ||||
operations are allowed to present delegation stateids, the | ||||
metadata server requires the ability to make the storage device | ||||
aware of the association between the metadata-server-chosen | ||||
stateid and the filehandle and delegation type it represents, and | ||||
to break such an association. | ||||
o TEST_STATEID is processed locally on the metadata server, without | ||||
storage device involvement. | ||||
o FREE_STATEID is processed on the metadata server but the metadata | ||||
server requires the ability to propagate the request to the | ||||
corresponding storage devices. | ||||
Because the client will possess and use stateids valid on the storage | ||||
device, there will be a client lease on the storage device and the | ||||
possibility of lease expiration does exist. The best approach for | ||||
the storage device is to retain these locks as a courtesy. However, | ||||
if it does not do so, control protocol facilities need to provide the | ||||
means to synchronize lock state between the metadata server and | ||||
storage device. | ||||
Clients will also have leases on the metadata server, which are | ||||
subject to expiration. In dealing with lease expiration, the | ||||
metadata server would be expected to use control protocol facilities | ||||
enabling it to invalidate revoked stateids on the storage device. In | ||||
the event the client is not responsive, the metadata server may need | ||||
to use fencing to prevent revoked stateids from being acted upon by | ||||
the storage device. | ||||
3. XDR Description of the Flexible File Layout Type | 3. XDR Description of the Flexible File Layout Type | |||
This document contains the external data representation (XDR) | This document contains the external data representation (XDR) | |||
[RFC4506] description of the flexible file layout type. The XDR | [RFC4506] description of the flexible file layout type. The XDR | |||
description is embedded in this document in a way that makes it | description is embedded in this document in a way that makes it | |||
simple for the reader to extract into a ready-to-compile form. The | simple for the reader to extract into a ready-to-compile form. The | |||
reader can feed this document into the following shell script to | reader can feed this document into the following shell script to | |||
produce the machine readable XDR description of the flexible file | produce the machine readable XDR description of the flexible file | |||
layout type: | layout type: | |||
skipping to change at page 12, line 20 ¶ | skipping to change at page 15, line 26 ¶ | |||
The ffda_versions array allows the metadata server to present choices | The ffda_versions array allows the metadata server to present choices | |||
as to NFS version, minor version, and coupling strength to the | as to NFS version, minor version, and coupling strength to the | |||
client. The ffdv_version and ffdv_minorversion represent the NFS | client. The ffdv_version and ffdv_minorversion represent the NFS | |||
protocol to be used to access the storage device. This layout | protocol to be used to access the storage device. This layout | |||
specification defines the semantics for ffdv_versions 3 and 4. If | specification defines the semantics for ffdv_versions 3 and 4. If | |||
ffdv_version equals 3 then the server MUST set ffdv_minorversion to 0 | ffdv_version equals 3 then the server MUST set ffdv_minorversion to 0 | |||
and ffdv_tightly_coupled to false. The client MUST then access the | and ffdv_tightly_coupled to false. The client MUST then access the | |||
storage device using the NFSv3 protocol [RFC1813]. If ffdv_version | storage device using the NFSv3 protocol [RFC1813]. If ffdv_version | |||
equals 4 then the server MUST set ffdv_minorversion to one of the | equals 4 then the server MUST set ffdv_minorversion to one of the | |||
NFSv4 minor version numbers and the client MUST access the storage | NFSv4 minor version numbers and the client MUST access the storage | |||
device using NFSv4. | device using NFSv4 with the specified minor version. | |||
Note that while the client might determine that it cannot use any of | Note that while the client might determine that it cannot use any of | |||
the configured combinations of ffdv_version, ffdv_minorversion, and | the configured combinations of ffdv_version, ffdv_minorversion, and | |||
ffdv_tightly_coupled, when it gets the device list from the metadata | ffdv_tightly_coupled, when it gets the device list from the metadata | |||
server, there is no way to indicate to the metadata server as to | server, there is no way to indicate to the metadata server as to | |||
which device it is version incompatible. If however, the client | which device it is version incompatible. If however, the client | |||
waits until it retrieves the layout from the metadata server, it can | waits until it retrieves the layout from the metadata server, it can | |||
at that time clearly identify the storage device in question (see | at that time clearly identify the storage device in question (see | |||
Section 5.3). | Section 5.3). | |||
skipping to change at page 14, line 5 ¶ | skipping to change at page 17, line 5 ¶ | |||
will designate the same storage device. When the storage device is | will designate the same storage device. When the storage device is | |||
accessed over NFSv4.1 or a higher minor version, the two storage | accessed over NFSv4.1 or a higher minor version, the two storage | |||
device addresses will support the implementation of client ID or | device addresses will support the implementation of client ID or | |||
session trunking (the latter is RECOMMENDED) as defined in [RFC5661]. | session trunking (the latter is RECOMMENDED) as defined in [RFC5661]. | |||
The two storage device addresses will share the same server owner or | The two storage device addresses will share the same server owner or | |||
major ID of the server owner. It is not always necessary for the two | major ID of the server owner. It is not always necessary for the two | |||
storage device addresses to designate the same storage device with | storage device addresses to designate the same storage device with | |||
trunking being used. For example, the data could be read-only, and | trunking being used. For example, the data could be read-only, and | |||
the data consist of exact replicas. | the data consist of exact replicas. | |||
5. Flexible File Layout type | 5. Flexible File Layout Type | |||
The layout4 type is defined in [RFC5662] as follows: | The layout4 type is defined in [RFC5662] as follows: | |||
<CODE BEGINS> | <CODE BEGINS> | |||
enum layouttype4 { | enum layouttype4 { | |||
LAYOUT4_NFSV4_1_FILES = 1, | LAYOUT4_NFSV4_1_FILES = 1, | |||
LAYOUT4_OSD2_OBJECTS = 2, | LAYOUT4_OSD2_OBJECTS = 2, | |||
LAYOUT4_BLOCK_VOLUME = 3, | LAYOUT4_BLOCK_VOLUME = 3, | |||
LAYOUT4_FLEX_FILES = 4 | LAYOUT4_FLEX_FILES = 4 | |||
skipping to change at page 17, line 12 ¶ | skipping to change at page 20, line 12 ¶ | |||
Section 5.3 for how to handle versioning issues between the client | Section 5.3 for how to handle versioning issues between the client | |||
and storage devices. | and storage devices. | |||
For tight coupling, ffds_stateid provides the stateid to be used by | For tight coupling, ffds_stateid provides the stateid to be used by | |||
the client to access the file. For loose coupling and a NFSv4 | the client to access the file. For loose coupling and a NFSv4 | |||
storage device, the client may use an anonymous stateid to perform I/ | storage device, the client may use an anonymous stateid to perform I/ | |||
O on the storage device as there is no use for the metadata server | O on the storage device as there is no use for the metadata server | |||
stateid (no control protocol). In such a scenario, the server MUST | stateid (no control protocol). In such a scenario, the server MUST | |||
set the ffds_stateid to be the anonymous stateid. | set the ffds_stateid to be the anonymous stateid. | |||
This specification of the ffds_stateid is mostly broken for the | ||||
tightly coupled model. There needs to exist a one to one mapping | ||||
from ffds_stateid to ffds_fh_vers - each open file on the storage | ||||
device might need an open stateid. As there are established loosely | ||||
coupled implementations of this version of the protocol, the only | ||||
viable approaches for a tightly coupled implementation would be to | ||||
either use an anonymous stateid for the ffds_stateid or restrict the | ||||
size of the ffds_fh_vers to be one. Fixing this issue will require a | ||||
new version of the protocol. | ||||
[[AI14: One reviewer points out for loosely coupled, we can use the | ||||
anon stateid and for tightly coupled we can use the "global stateid". | ||||
These make it appear that the bug in the spec was actually a feature. | ||||
The intent here is to own up to the bug and shipping code. Can it be | ||||
said nicer? --TH]] | ||||
For loosely coupled storage devices, ffds_user and ffds_group provide | For loosely coupled storage devices, ffds_user and ffds_group provide | |||
the synthetic user and group to be used in the RPC credentials that | the synthetic user and group to be used in the RPC credentials that | |||
the client presents to the storage device to access the data files. | the client presents to the storage device to access the data files. | |||
For tightly coupled storage devices, the user and group on the | For tightly coupled storage devices, the user and group on the | |||
storage device will be the same as on the metadata server. I.e., if | storage device will be the same as on the metadata server. I.e., if | |||
ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST | ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST | |||
ignore both ffds_user and ffds_group. | ignore both ffds_user and ffds_group. | |||
The allowed values for both ffds_user and ffds_group are specified in | The allowed values for both ffds_user and ffds_group are specified in | |||
Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group | Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group | |||
skipping to change at page 17, line 44 ¶ | skipping to change at page 21, line 12 ¶ | |||
higher perceived utility. The way the client can select the best | higher perceived utility. The way the client can select the best | |||
mirror to access is discussed in Section 8.1. | mirror to access is discussed in Section 8.1. | |||
ffl_flags is a bitmap that allows the metadata server to inform the | ffl_flags is a bitmap that allows the metadata server to inform the | |||
client of particular conditions that may result from the more or less | client of particular conditions that may result from the more or less | |||
tight coupling of the storage devices. | tight coupling of the storage devices. | |||
FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is | FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is | |||
not required to send LAYOUTCOMMIT to the metadata server. | not required to send LAYOUTCOMMIT to the metadata server. | |||
FF_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client | F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client | |||
SHOULD not send IO operations to the metadata server. I.e., even | SHOULD not send I/O operations to the metadata server. I.e., even | |||
if a storage device is partitioned from the client, the client | if the client could determine that there was a network diconnect | |||
SHOULD not try to proxy the IO through the metadata server. | to a storage device, the client SHOULD not try to proxy the I/O | |||
through the metadata server. | ||||
FF_FLAGS_NO_READ_IO: can be set to indicate that the client SHOULD | FF_FLAGS_NO_READ_IO: can be set to indicate that the client SHOULD | |||
not send READ requests with the layouts of iomode | not send READ requests with the layouts of iomode | |||
LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode | LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode | |||
LAYOUTIOMODE4_READ from the metadata server. | LAYOUTIOMODE4_READ from the metadata server. | |||
5.1.1. Error codes from LAYOUTGET | 5.1.1. Error Codes from LAYOUTGET | |||
[RFC5661] provides little guidance as to how the client is to proceed | [RFC5661] provides little guidance as to how the client is to proceed | |||
with a LAYOUTEGT which returns an error of either | with a LAYOUTEGT which returns an error of either | |||
NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. | NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. | |||
Within the context of this document: | ||||
NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the IO | NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O | |||
is to go to the metadata server. Note that it is possible to have | is to go to the metadata server. Note that it is possible to have | |||
had a layout before a recall and not after. | had a layout before a recall and not after. | |||
NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout | NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout | |||
from being granted. If the client already has an appropriate | from being granted. If the client already has an appropriate | |||
layout, it SHOULD continue with IO to the storage devices. | layout, it should continue with I/O to the storage devices. | |||
NFS4ERR_DELAY: there is some issue preventing the layout from being | NFS4ERR_DELAY: there is some issue preventing the layout from being | |||
granted. If the client already has an appropriate layout, it | granted. If the client already has an appropriate layout, it | |||
SHOULD not continue with IO to the storage devices. | should not continue with I/O to the storage devices. | |||
5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS | 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS | |||
If the client does not ask for a layout for a file, then the IO will | Even if the metadata server provides the FF_FLAGS_NO_IO_THRU_MDS, | |||
go through the metadata server. Thus, even if the metadata server | flag, the client can still perform I/O to the metadata server. The | |||
sets the FF_FLAGS_NO_IO_THRU_MDS flag, it can recall the layout and | flag is at best a hint. The flag is indicating to the client that | |||
either not set the flag on the new layout or not provide a layout. | the metadata server most likely wants to separate the metadata I/O | |||
When a client encounters an error with a storage device, it typically | from the data I/O to increase the performance of the metadata | |||
returns the layout to the metadata server and requests a new layout. | operations. If the metadata server detects that the client is | |||
The client's IO would then proceed according to the status codes as | performing I/O against it despite the use of the | |||
outlined in Section 5.1.1. | FF_FLAGS_NO_IO_THRU_MDS flag, it can recall the layout and either not | |||
set the flag on the new layout or not provide a layout (perhaps the | ||||
intent was for the server to temporarily prevent data I/O to meet | ||||
some goal). The client's I/O would then proceed according to the | ||||
status codes as outlined in Section 5.1.1. | ||||
5.2. Interactions Between Devices and Layouts | 5.2. Interactions Between Devices and Layouts | |||
In [RFC5661], the file layout type is defined such that the | In [RFC5661], the file layout type is defined such that the | |||
relationship between multipathing and filehandles can result in | relationship between multipathing and filehandles can result in | |||
either 0, 1, or N filehandles (see Section 13.3). Some rationals for | either 0, 1, or N filehandles (see Section 13.3). Some rationals for | |||
this are clustered servers which share the same filehandle or | this are clustered servers which share the same filehandle or | |||
allowing for multiple read-only copies of the file on the same | allowing for multiple read-only copies of the file on the same | |||
storage device. In the flexible file layout type, while there is an | storage device. In the flexible file layout type, while there is an | |||
array of filehandles, they are independent of the multipathing being | array of filehandles, they are independent of the multipathing being | |||
skipping to change at page 20, line 24 ¶ | skipping to change at page 23, line 47 ¶ | |||
The metadata server analyzes the error and determines the required | The metadata server analyzes the error and determines the required | |||
recovery operations such as recovering media failures or | recovery operations such as recovering media failures or | |||
reconstructing missing data files. | reconstructing missing data files. | |||
The metadata server SHOULD recall any outstanding layouts to allow it | The metadata server SHOULD recall any outstanding layouts to allow it | |||
exclusive write access to the stripes being recovered and to prevent | exclusive write access to the stripes being recovered and to prevent | |||
other clients from hitting the same error condition. In these cases, | other clients from hitting the same error condition. In these cases, | |||
the server MUST complete recovery before handing out any new layouts | the server MUST complete recovery before handing out any new layouts | |||
to the affected byte ranges. | to the affected byte ranges. | |||
Although it MAY be acceptable for the client to propagate a | Although the client implementation has the option to propagate a | |||
corresponding error to the application that initiated the I/O | corresponding error to the application that initiated the I/O | |||
operation and drop any unwritten data, the client SHOULD attempt to | operation and drop any unwritten data, the client should attempt to | |||
retry the original I/O operation by requesting a new layout using | retry the original I/O operation by either requesting a new layout or | |||
LAYOUTGET and retry the I/O operation(s) using the new layout, or the | sending the I/O via regular NFSv4.1+ READ or WRITE operations to the | |||
client MAY just retry the I/O operation(s) using regular NFS READ or | metadata server. The client SHOULD attempt to retrieve a new layout | |||
WRITE operations via the metadata server. The client SHOULD attempt | and retry the I/O operation using the storage device first and only | |||
to retrieve a new layout and retry the I/O operation using the | if the error persists, retry the I/O operation via the metadata | |||
storage device first and only if the error persists, retry the I/O | server. | |||
operation via the metadata server. | ||||
8. Mirroring | 8. Mirroring | |||
The flexible file layout type has a simple model in place for the | The flexible file layout type has a simple model in place for the | |||
mirroring of the file data constrained by a layout segment. There is | mirroring of the file data constrained by a layout segment. There is | |||
no assumption that each copy of the mirror is stored identically on | no assumption that each copy of the mirror is stored identically on | |||
the storage devices, i.e., one device might employ compression or | the storage devices. For example, one device might employ | |||
deduplication on the data. However, the over the wire transfer of | compression or deduplication on the data. However, the over the wire | |||
the file contents MUST appear identical. Note, this is a construct | transfer of the file contents MUST appear identical. Note, this is a | |||
of the selected XDR representation that each mirrored copy of the | constraint of the selected XDR representation in which each mirrored | |||
layout segment has the same striping pattern (see Figure 1). | copy of the layout segment has the same striping pattern (see | |||
Figure 1). | ||||
The metadata server is responsible for determining the number of | The metadata server is responsible for determining the number of | |||
mirrored copies and the location of each mirror. While the client | mirrored copies and the location of each mirror. While the client | |||
may provide a hint to how many copies it wants (see Section 12), the | may provide a hint to how many copies it wants (see Section 12), the | |||
metadata server can ignore that hint and in any event, the client has | metadata server can ignore that hint and in any event, the client has | |||
no means to dictate neither the storage device (which also means the | no means to dictate neither the storage device (which also means the | |||
coupling and/or protocol levels to access the layout segments) nor | coupling and/or protocol levels to access the layout segments) nor | |||
the location of said storage device. | the location of said storage device. | |||
The updating of mirrored layout segments is done via client-side | The updating of mirrored layout segments is done via client-side | |||
mirroring. With this approach, the client is responsible for making | mirroring. With this approach, the client is responsible for making | |||
sure modifications get to all copies of the layout segments it is | sure modifications are made on all copies of the layout segments it | |||
informed of via the layout. If a layout segment is being resilvered | is informed of via the layout. If a layout segment is being | |||
to a storage device, that mirrored copy will not be in the layout. | resilvered to a storage device, that mirrored copy will not be in the | |||
Thus the metadata server MUST update that copy until the client is | layout. Thus the metadata server MUST update that copy until the | |||
presented it in a layout. Also, if the client is writing to the | client is presented it in a layout. If the client is writing to the | |||
layout segments via the metadata server, e.g., using an earlier | layout segments via the metadata server, then the metadata server | |||
version of the protocol, then the metadata server MUST update all | MUST update all copies of the mirror. As seen in Section 8.3, during | |||
copies of the mirror. As seen in Section 8.3, during the | the resilvering, the layout is recalled, and the client has to make | |||
resilvering, the layout is recalled, and the client has to make | ||||
modifications via the metadata server. | modifications via the metadata server. | |||
8.1. Selecting a Mirror | 8.1. Selecting a Mirror | |||
When the metadata server grants a layout to a client, it MAY let the | When the metadata server grants a layout to a client, it MAY let the | |||
client know how fast it expects each mirror to be once the request | client know how fast it expects each mirror to be once the request | |||
arrives at the storage devices via the ffds_efficiency member. While | arrives at the storage devices via the ffds_efficiency member. While | |||
the algorithms to calculate that value are left to the metadata | the algorithms to calculate that value are left to the metadata | |||
server implementations, factors that could contribute to that | server implementations, factors that could contribute to that | |||
calculation include speed of the storage device, physical memory | calculation include speed of the storage device, physical memory | |||
skipping to change at page 21, line 44 ¶ | skipping to change at page 25, line 19 ¶ | |||
device because it has no presence on the given subnet. | device because it has no presence on the given subnet. | |||
As such, it is the client which decides which mirror to access for | As such, it is the client which decides which mirror to access for | |||
reading the file. The requirements for writing to a mirrored layout | reading the file. The requirements for writing to a mirrored layout | |||
segments are presented below. | segments are presented below. | |||
8.2. Writing to Mirrors | 8.2. Writing to Mirrors | |||
The client is responsible for updating all mirrored copies of the | The client is responsible for updating all mirrored copies of the | |||
layout segments that it is given in the layout. A single failed | layout segments that it is given in the layout. A single failed | |||
update is sufficient to fail the entire operation. I.e., if all but | update is sufficient to fail the entire operation. If all but one | |||
one copy is updated successfully and the last one provides an error, | copy is updated successfully and the last one provides an error, then | |||
then the client needs to inform the metadata server about the error | the client needs to inform the metadata server about the error via | |||
via either LAYOUTRETURN or LAYOUTERROR that the update failed to that | either LAYOUTRETURN or LAYOUTERROR that the update failed to that | |||
storage device. If the client is updating the mirrors serially, then | storage device. If the client is updating the mirrors serially, then | |||
it SHOULD stop at the first error encountered and report that to the | it SHOULD stop at the first error encountered and report that to the | |||
metadata server. If the client is updating the mirrors in parallel, | metadata server. If the client is updating the mirrors in parallel, | |||
then it SHOULD wait until all storage devices respond such that it | then it SHOULD wait until all storage devices respond such that it | |||
can report all errors encountered during the update. | can report all errors encountered during the update. | |||
The metadata server is then responsible for determining if it wants | The metadata server is then responsible for determining if it wants | |||
to remove the errant mirror from the layout, if the mirror has | to remove the errant mirror from the layout, if the mirror has | |||
recovered from some transient error, etc. When the client tries to | recovered from some transient error, etc. When the client tries to | |||
get a new layout, the metadata server informs it of the decision by | get a new layout, the metadata server informs it of the decision by | |||
the contents of the layout. The client MUST NOT make any assumptions | the contents of the layout. The client MUST NOT make any assumptions | |||
that the contents of the previous layout will match those of the new | that the contents of the previous layout will match those of the new | |||
one. If it has updates that were not committed, it MUST resend those | one. If it has updates that were not committed to all mirrors, then | |||
updates to all mirrors. | it MUST resend those updates to all mirrors. | |||
There is no provision in the protocol for the metadata server to | There is no provision in the protocol for the metadata server to | |||
directly determine that the client has or has not recovered from an | directly determine that the client has or has not recovered from an | |||
error. I.e., assume that the storage device was network partitioned | error. I.e., assume that the storage device was network partitioned | |||
from the client and all of the copies are successfully updated after | from the client and all of the copies are successfully updated after | |||
the error was reported. There is no mechanism for the client to | the error was reported. There is no mechanism for the client to | |||
report that fact and the metadata server is forced to repair the file | report that fact and the metadata server is forced to repair the file | |||
across the mirror. | across the mirror. | |||
If the client supports NFSv4.2, it can use LAYOUTERROR and | If the client supports NFSv4.2, it can use LAYOUTERROR and | |||
skipping to change at page 22, line 43 ¶ | skipping to change at page 26, line 18 ¶ | |||
The metadata server may elect to create a new mirror of the layout | The metadata server may elect to create a new mirror of the layout | |||
segments at any time. This might be to resilver a copy on a storage | segments at any time. This might be to resilver a copy on a storage | |||
device which was down for servicing, to provide a copy of the layout | device which was down for servicing, to provide a copy of the layout | |||
segments on storage with different storage performance | segments on storage with different storage performance | |||
characteristics, etc. As the client will not be aware of the new | characteristics, etc. As the client will not be aware of the new | |||
mirror and the metadata server will not be aware of updates that the | mirror and the metadata server will not be aware of updates that the | |||
client is making to the layout segments, the metadata server MUST | client is making to the layout segments, the metadata server MUST | |||
recall the writable layout segment(s) that it is resilvering. If the | recall the writable layout segment(s) that it is resilvering. If the | |||
client issues a LAYOUTGET for a writable layout segment which is in | client issues a LAYOUTGET for a writable layout segment which is in | |||
the process of being resilvered, then the metadata server MUST deny | the process of being resilvered, then the metadata server can deny | |||
that request with a NFS4ERR_LAYOUTTRYLATER. The client can then | that request with a NFS4ERR_LAYOUTUNAVAILABLE. The client would then | |||
perform the I/O through the metadata server. | have to perform the I/O through the metadata server. | |||
9. Flexible Files Layout Type Return | 9. Flexible Files Layout Type Return | |||
layoutreturn_file4 is used in the LAYOUTRETURN operation to convey | layoutreturn_file4 is used in the LAYOUTRETURN operation to convey | |||
layout-type specific information to the server. It is defined in | layout-type specific information to the server. It is defined in | |||
[RFC5661] as follows: | Section 18.44.1 of [RFC5661] as follows: | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ | ||||
const LAYOUT4_RET_REC_FILE = 1; | ||||
const LAYOUT4_RET_REC_FSID = 2; | ||||
const LAYOUT4_RET_REC_ALL = 3; | ||||
enum layoutreturn_type4 { | ||||
LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, | ||||
LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, | ||||
LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL | ||||
}; | ||||
struct layoutreturn_file4 { | struct layoutreturn_file4 { | |||
offset4 lrf_offset; | offset4 lrf_offset; | |||
length4 lrf_length; | length4 lrf_length; | |||
stateid4 lrf_stateid; | stateid4 lrf_stateid; | |||
/* layouttype4 specific data */ | /* layouttype4 specific data */ | |||
opaque lrf_body<>; | opaque lrf_body<>; | |||
}; | }; | |||
union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { | union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { | |||
case LAYOUTRETURN4_FILE: | case LAYOUTRETURN4_FILE: | |||
layoutreturn_file4 lr_layout; | layoutreturn_file4 lr_layout; | |||
default: | default: | |||
void; | void; | |||
}; | }; | |||
struct LAYOUTRETURN4args { | struct LAYOUTRETURN4args { | |||
/* CURRENT_FH: file */ | /* CURRENT_FH: file */ | |||
bool lora_reclaim; | bool lora_reclaim; | |||
skipping to change at page 23, line 33 ¶ | skipping to change at page 27, line 22 ¶ | |||
/* CURRENT_FH: file */ | /* CURRENT_FH: file */ | |||
bool lora_reclaim; | bool lora_reclaim; | |||
layoutreturn_stateid lora_recallstateid; | layoutreturn_stateid lora_recallstateid; | |||
layouttype4 lora_layout_type; | layouttype4 lora_layout_type; | |||
layoutiomode4 lora_iomode; | layoutiomode4 lora_iomode; | |||
layoutreturn4 lora_layoutreturn; | layoutreturn4 lora_layoutreturn; | |||
}; | }; | |||
<CODE ENDS> | <CODE ENDS> | |||
If the lora_layout_type layout type is LAYOUT4_FLEX_FILES, then the | If the lora_layout_type layout type is LAYOUT4_FLEX_FILES and the | |||
lrf_body opaque value is defined by ff_layoutreturn4 (See | lr_returntype is LAYOUTRETURN4_FILE, then the lrf_body opaque value | |||
Section 9.3). It allows the client to report I/O error information | is defined by ff_layoutreturn4 (See Section 9.3). It allows the | |||
or layout usage statistics back to the metadata server as defined | client to report I/O error information or layout usage statistics | |||
below. | back to the metadata server as defined below. Note that while the | |||
data strucures are built on concepts introduced in NFSv4.2, the | ||||
effective discriminated union (lora_layout_type combined with | ||||
ff_layoutreturn4) allows for a NFSv4.1 metadata server to utilize the | ||||
data. | ||||
9.1. I/O Error Reporting | 9.1. I/O Error Reporting | |||
9.1.1. ff_ioerr4 | 9.1.1. ff_ioerr4 | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/// struct ff_ioerr4 { | /// struct ff_ioerr4 { | |||
/// offset4 ffie_offset; | /// offset4 ffie_offset; | |||
/// length4 ffie_length; | /// length4 ffie_length; | |||
skipping to change at page 26, line 33 ¶ | skipping to change at page 30, line 28 ¶ | |||
<CODE BEGINS> | <CODE BEGINS> | |||
struct io_info4 { | struct io_info4 { | |||
uint64_t ii_count; | uint64_t ii_count; | |||
uint64_t ii_bytes; | uint64_t ii_bytes; | |||
}; | }; | |||
<CODE ENDS> | <CODE ENDS> | |||
With pNFS, the data transfers are performed directly between the pNFS | With pNFS, data transfers are performed directly between the pNFS | |||
client and the storage devices. Therefore, the metadata server has | client and the storage devices. Therefore, the metadata server has | |||
no visibility to the I/O stream and cannot use any statistical | no direct knowledge to the I/O operations being done and thus can not | |||
information about client I/O to optimize data storage location. | create on its own statistical information about client I/O to | |||
ff_iostats4 MAY be used by the client to report I/O statistics back | optimize data storage location. ff_iostats4 MAY be used by the | |||
to the metadata server upon returning the layout. Since it is | client to report I/O statistics back to the metadata server upon | |||
infeasible for the client to report every I/O that used the layout, | returning the layout. | |||
the client MAY identify "hot" byte ranges for which to report I/O | ||||
statistics. The definition and/or configuration mechanism of what is | Since it is not feasible for the client to report every I/O that used | |||
considered "hot" and the size of the reported byte range is out of | the layout, the client MAY identify "hot" byte ranges for which to | |||
the scope of this document. It is suggested for client | report I/O statistics. The definition and/or configuration mechanism | |||
of what is considered "hot" and the size of the reported byte range | ||||
is out of the scope of this document. It is suggested for client | ||||
implementation to provide reasonable default values and an optional | implementation to provide reasonable default values and an optional | |||
run-time management interface to control these parameters. For | run-time management interface to control these parameters. For | |||
example, a client can define the default byte range resolution to be | example, a client can define the default byte range resolution to be | |||
1 MB in size and the thresholds for reporting to be 1 MB/second or 10 | 1 MB in size and the thresholds for reporting to be 1 MB/second or 10 | |||
I/O operations per second. For each byte range, ffis_offset and | I/O operations per second. | |||
ffis_length represent the starting offset of the range and the range | ||||
length in bytes. ffis_read.ii_count, ffis_read.ii_bytes, | ||||
ffis_write.ii_count, and ffis_write.ii_bytes represent, respectively, | ||||
the number of contiguous read and write I/Os and the respective | ||||
aggregate number of bytes transferred within the reported byte range. | ||||
The combination of ffis_deviceid and ffl_addr uniquely identify both | For each byte range, ffis_offset and ffis_length represent the | |||
the storage path and the network route to it. Finally, the | starting offset of the range and the range length in bytes. | |||
ffis_read.ii_count, ffis_read.ii_bytes, ffis_write.ii_count, and | ||||
ffis_write.ii_bytes represent, respectively, the number of contiguous | ||||
read and write I/Os and the respective aggregate number of bytes | ||||
transferred within the reported byte range. | ||||
The combination of ffis_deviceid and ffl_addr uniquely identifies | ||||
both the storage path and the network route to it. Finally, the | ||||
ffl_fhandle allows the metadata server to differentiate between | ffl_fhandle allows the metadata server to differentiate between | |||
multiple read-only copies of the file on the same storage device. | multiple read-only copies of the file on the same storage device. | |||
9.3. ff_layoutreturn4 | 9.3. ff_layoutreturn4 | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/// struct ff_layoutreturn4 { | /// struct ff_layoutreturn4 { | |||
/// ff_ioerr4 fflr_ioerr_report<>; | /// ff_ioerr4 fflr_ioerr_report<>; | |||
/// ff_iostats4 fflr_iostats_report<>; | /// ff_iostats4 fflr_iostats_report<>; | |||
skipping to change at page 29, line 18 ¶ | skipping to change at page 33, line 18 ¶ | |||
reasons for recalling a layout, the flexible file layout type | reasons for recalling a layout, the flexible file layout type | |||
metadata server should recall outstanding layouts in the following | metadata server should recall outstanding layouts in the following | |||
cases: | cases: | |||
o When the file's security policy changes, i.e., Access Control | o When the file's security policy changes, i.e., Access Control | |||
Lists (ACLs) or permission mode bits are set. | Lists (ACLs) or permission mode bits are set. | |||
o When the file's layout changes, rendering outstanding layouts | o When the file's layout changes, rendering outstanding layouts | |||
invalid. | invalid. | |||
o When there are sharing conflicts. | o When existing layouts are inconsistent with the need to enforce | |||
locking constraints. | ||||
o When a file is being resilvered, either due to being repaired | o When existing layouts are inconsistent with the requirements | |||
after a write error or to load balance. | regarding resilvering as described in Section 8.3. | |||
13.1. CB_RECALL_ANY | 13.1. CB_RECALL_ANY | |||
The metadata server can use the CB_RECALL_ANY callback operation to | The metadata server can use the CB_RECALL_ANY callback operation to | |||
notify the client to return some or all of its layouts. The | notify the client to return some or all of its layouts. [RFC5661] | |||
[RFC5661] defines the following types: | defines the allowed types, but makes no provision to expand them. It | |||
does hint that "storage protocols" can expand the range, but does not | ||||
define such a process. If we put the values under IANA control, then | ||||
we could define the following types: | ||||
<CODE BEGINS> | <CODE BEGINS> | |||
const RCA4_TYPE_MASK_FF_LAYOUT_MIN = -2; | const RCA4_TYPE_MASK_FF_LAYOUT_MIN = -2; | |||
const RCA4_TYPE_MASK_FF_LAYOUT_MAX = -1; | const RCA4_TYPE_MASK_FF_LAYOUT_MAX = -1; | |||
[[RFC Editor: please insert assigned constants]] | [[RFC Editor: please insert assigned constants]] | |||
struct CB_RECALL_ANY4args { | struct CB_RECALL_ANY4args { | |||
uint32_t craa_layouts_to_keep; | uint32_t craa_layouts_to_keep; | |||
bitmap4 craa_type_mask; | bitmap4 craa_type_mask; | |||
}; | }; | |||
<CODE ENDS> | <CODE ENDS> | |||
[[AI13: No, 5661 does not define these above values. The ask here is | ||||
to create these and _add_ them to 5661. --TH]] | ||||
Typically, CB_RECALL_ANY will be used to recall client state when the | Typically, CB_RECALL_ANY will be used to recall client state when the | |||
server needs to reclaim resources. The craa_type_mask bitmap | server needs to reclaim resources. The craa_type_mask bitmap | |||
specifies the type of resources that are recalled and the | specifies the type of resources that are recalled and the | |||
craa_layouts_to_keep value specifies how many of the recalled | craa_layouts_to_keep value specifies how many of the recalled | |||
flexible file layouts the client is allowed to keep. The flexible | flexible file layouts the client is allowed to keep. The flexible | |||
file layout type mask flags are defined as follows: | file layout type mask flags are defined as follows: | |||
<CODE BEGINS> | <CODE BEGINS> | |||
/// enum ff_cb_recall_any_mask { | /// enum ff_cb_recall_any_mask { | |||
/// FF_RCA4_TYPE_MASK_READ = -2, | /// FF_RCA4_TYPE_MASK_READ = -2, | |||
/// FF_RCA4_TYPE_MASK_RW = -1 | /// FF_RCA4_TYPE_MASK_RW = -1 | |||
[[RFC Editor: please insert assigned constants]] | [[RFC Editor: please insert assigned constants]] | |||
/// }; | /// }; | |||
/// | /// | |||
<CODE ENDS> | <CODE ENDS> | |||
They represent the iomode of the recalled layouts. In response, the | They represent the iomode of the recalled layouts. In response, the | |||
skipping to change at page 30, line 47 ¶ | skipping to change at page 34, line 50 ¶ | |||
The control path contains all the new operations described by this | The control path contains all the new operations described by this | |||
extension; all existing NFSv4 security mechanisms and features apply | extension; all existing NFSv4 security mechanisms and features apply | |||
to the control path. The combination of components in a pNFS system | to the control path. The combination of components in a pNFS system | |||
is required to preserve the security properties of NFSv4.1+ with | is required to preserve the security properties of NFSv4.1+ with | |||
respect to an entity accessing data via a client, including security | respect to an entity accessing data via a client, including security | |||
countermeasures to defend against threats that NFSv4.1+ provides | countermeasures to defend against threats that NFSv4.1+ provides | |||
defenses for in environments where these threats are considered | defenses for in environments where these threats are considered | |||
significant. | significant. | |||
The metadata server enforces the file access-control policy at | The metadata server enforces the file access-control policy at | |||
LAYOUTGET time. The client should use suitable authorization | LAYOUTGET time. The client should use RPC authorization credentials | |||
credentials for getting the layout for the requested iomode (READ or | (uid/gid for AUTH_SYS or tickets for Kerberos) for getting the layout | |||
RW) and the server verifies the permissions and ACL for these | for the requested iomode (READ or RW) and the server verifies the | |||
credentials, possibly returning NFS4ERR_ACCESS if the client is not | permissions and ACL for these credentials, possibly returning | |||
allowed the requested iomode. If the LAYOUTGET operation succeeds | NFS4ERR_ACCESS if the client is not allowed the requested iomode. If | |||
the client receives, as part of the layout, a set of credentials | the LAYOUTGET operation succeeds the client receives, as part of the | |||
allowing it I/O access to the specified data files corresponding to | layout, a set of credentials allowing it I/O access to the specified | |||
the requested iomode. When the client acts on I/O operations on | data files corresponding to the requested iomode. When the client | |||
behalf of its local users, it MUST authenticate and authorize the | acts on I/O operations on behalf of its local users, it MUST | |||
user by issuing respective OPEN and ACCESS calls to the metadata | authenticate and authorize the user by issuing respective OPEN and | |||
server, similar to having NFSv4 data delegations. If access is | ACCESS calls to the metadata server, similar to having NFSv4 data | |||
allowed, the client uses the corresponding (READ or RW) credentials | delegations. | |||
to perform the I/O operations at the data file's storage devices. | ||||
When the metadata server receives a request to change a file's | If access is allowed, the client uses the corresponding (READ or RW) | |||
permissions or ACL, it SHOULD recall all layouts for that file and it | credentials to perform the I/O operations at the data file's storage | |||
MUST fence off the clients holding outstanding layouts for the | devices. When the metadata server receives a request to change a | |||
respective file by implicitly invalidating the outstanding | file's permissions or ACL, it SHOULD recall all layouts for that file | |||
credentials on all data files comprising before committing to the new | and then MUST fence off any clients still holding outstanding layouts | |||
permissions and ACL. Doing this will ensure that clients re- | for the respective files by implicitly invalidating the previously | |||
authorize their layouts according to the modified permissions and ACL | distributed credential on all data file comprising the file in | |||
by requesting new layouts. Recalling the layouts in this case is | question. It is REQUIRED that this be done before committing to the | |||
courtesy of the server intended to prevent clients from getting an | new permissions and/or ACL. By requesting new layouts, the clients | |||
error on I/Os done after the client was fenced off. | will reauthorize access against the modified access control metadata. | |||
Recalling the layouts in this case is intended to prevent clients | ||||
from getting an error on I/Os done after the client was fenced off. | ||||
15.1. Kerberized File Access | 15.1. Kerberized File Access | |||
15.1.1. Loosely Coupled | 15.1.1. Loosely Coupled | |||
RPCSEC_GSS version 3 (RPCSEC_GSSv3) [rpcsec_gssv3] could be used to | ||||
authorize the client to the storage device on behalf of the metadata | ||||
server. This would require that each of the metadata server, storage | ||||
device, and client would have to implement RPCSEC_GSSv3. The second | ||||
requirement does not match the intent of the loosely coupled model | ||||
that the storage device need not be modified. | ||||
Under this coupling model, the principal used to authenticate the | Under this coupling model, the principal used to authenticate the | |||
metadata file is different than that used to authenticate the data | metadata file is different than that used to authenticate the data | |||
file. I.e., the synthetic principals generated to control access to | file. For the metadata server, the user credentials would be | |||
the data file could prove to be difficult to manage. | generated by the same Kerberos server as the client. However, for | |||
the data storage access, the metadata server would generate the | ||||
While RPCSEC_GSS version 3 (RPCSEC_GSSv3) [rpcsec_gssv3] could be | ticket granting tickets and provide them to the client. Fencing | |||
used to authorize the client to the storage device on behalf of the | would then be controlled either by expiring the ticket or by | |||
metadata server, such a requirement exceeds the loose coupling model. | modifying the syntethic uid or gid on the data file. | |||
I.e., each of the metadata server, storage device, and client would | ||||
have to implement RPCSEC_GSSv3. | ||||
In all, while either an elaborate schema could be used to | ||||
automatically authenticate principals or RPCSEC_GSSv3 aware clients, | ||||
metadata server, and storage devices could be deployed, if more | ||||
secure authentication is desired, tight coupling should be considered | ||||
as described in the next section. | ||||
15.1.2. Tightly Coupled | 15.1.2. Tightly Coupled | |||
With tight coupling, the principal used to access the metadata file | With tight coupling, the principal used to access the metadata file | |||
is exactly the same as used to access the data file. Thus there are | is exactly the same as used to access the data file. As a result | |||
no security issues related to using Kerberos with a tightly coupled | there are no security issues related to using Kerberos with a tightly | |||
system. | coupled system. | |||
16. IANA Considerations | 16. IANA Considerations | |||
As described in [RFC5661], new layout type numbers have been assigned | As described in [RFC5661], new layout type numbers have been assigned | |||
by IANA. This document defines the protocol associated with the | by IANA. This document defines the protocol associated with the | |||
existing layout type number, LAYOUT4_FLEX_FILES. | existing layout type number, LAYOUT4_FLEX_FILES. | |||
17. References | 17. References | |||
17.1. Normative References | 17.1. Normative References | |||
skipping to change at page 33, line 29 ¶ | skipping to change at page 37, line 29 ¶ | |||
document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, | document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, | |||
and Lev Solomonov. | and Lev Solomonov. | |||
Those who provided miscellaneous comments to the final drafts of this | Those who provided miscellaneous comments to the final drafts of this | |||
document include: Anand Ganesh, Robert Wipfel, Gobikrishnan | document include: Anand Ganesh, Robert Wipfel, Gobikrishnan | |||
Sundharraj, and Trond Myklebust. | Sundharraj, and Trond Myklebust. | |||
Idan Kedar caught a nasty bug in the interaction of client side | Idan Kedar caught a nasty bug in the interaction of client side | |||
mirroring and the minor versioning of devices. | mirroring and the minor versioning of devices. | |||
Dave Noveck provided a comprehensive review of the document during | Dave Noveck provided comprehensive reviews of the document during the | |||
the working group last call. | working group last calls. | |||
Olga Kornievskaia lead the charge against the use of a credential | Olga Kornievskaiaa made a convincing case against the use of a | |||
versus a principal in the fencing approach. Andy Adamson and | credential versus a principal in the fencing approach. Andy Adamson | |||
Benjamin Kaduk helped to sharpen the focus. | and Benjamin Kaduk helped to sharpen the focus. | |||
Tigran Mkrtchyan provided the use case for not allowing the client to | Tigran Mkrtchyan provided the use case for not allowing the client to | |||
proxy the IO through the data server. | proxy the I/O through the data server. | |||
Appendix B. RFC Editor Notes | Appendix B. RFC Editor Notes | |||
[RFC Editor: please remove this section prior to publishing this | [RFC Editor: please remove this section prior to publishing this | |||
document as an RFC] | document as an RFC] | |||
[RFC Editor: prior to publishing this document as an RFC, please | [RFC Editor: prior to publishing this document as an RFC, please | |||
replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the | replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the | |||
RFC number of this document] | RFC number of this document] | |||
End of changes. 52 change blocks. | ||||
207 lines changed or deleted | 404 lines changed or added | |||
This html diff was produced by rfcdiff 1.45. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |