--- 1/draft-ietf-nfsv4-flex-files-15.txt 2018-01-25 16:13:22.834631235 -0800 +++ 2/draft-ietf-nfsv4-flex-files-16.txt 2018-01-25 16:13:22.918633225 -0800 @@ -1,19 +1,19 @@ NFSv4 B. Halevy Internet-Draft Intended status: Standards Track T. Haynes -Expires: May 24, 2018 Primary Data - November 20, 2017 +Expires: July 29, 2018 Primary Data + January 25, 2018 Parallel NFS (pNFS) Flexible File Layout - draft-ietf-nfsv4-flex-files-15.txt + draft-ietf-nfsv4-flex-files-16.txt Abstract The Parallel Network File System (pNFS) allows a separation between the metadata (onto a metadata server) and data (onto a storage device) for a file. The flexible file layout type is defined in this document as an extension to pNFS which allows the use of storage devices in a fashion such that they require only a quite limited degree of interaction with the metadata server, using already existing protocols. Client-side mirroring is also added to provide @@ -27,119 +27,135 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on May 24, 2018. + This Internet-Draft will expire on July 29, 2018. Copyright Notice - Copyright (c) 2017 IETF Trust and the persons identified as the + Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 - 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 + 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 4 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 6 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 - 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 - 2.2. Fencing Clients from the Storage Device . . . . . . . . . 6 + 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 7 + 2.2. Fencing Clients from the Storage Device . . . . . . . . . 7 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 8 - 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 8 - 2.3. State and Locking Models . . . . . . . . . . . . . . . . 9 + 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 9 + 2.3. State and Locking Models . . . . . . . . . . . . . . . . 10 2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 10 - 2.3.2. Tightly Coupled Locking Model . . . . . . . . . . . . 11 + 2.3.2. Tightly Coupled Locking Model . . . . . . . . . . . . 12 3. XDR Description of the Flexible File Layout Type . . . . . . 13 - 3.1. Code Components Licensing Notice . . . . . . . . . . . . 13 + 3.1. Code Components Licensing Notice . . . . . . . . . . . . 14 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 15 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 15 - 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 16 - 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 17 - 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 18 + 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 17 + 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 18 + 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 22 - 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 22 - 5.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 22 + 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 23 + 5.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 23 5.3. Interactions Between Devices and Layouts . . . . . . . . 23 5.4. Handling Version Errors . . . . . . . . . . . . . . . . . 23 - 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 23 + 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 24 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 24 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 25 - 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 25 + 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 26 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 26 8.2.1. Single Storage Device Updates Mirrors . . . . . . . . 26 - 8.2.2. Single Storage Device Updates Mirrors . . . . . . . . 26 - 8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 26 + 8.2.2. Client Updates All Mirrors . . . . . . . . . . . . . 26 + 8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 27 8.2.4. Handling Write COMMITs . . . . . . . . . . . . . . . 27 8.3. Metadata Server Resilvering of the File . . . . . . . . . 28 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 28 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 29 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 29 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 30 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 30 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 31 - 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 31 + 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 32 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 33 - 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 33 - 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 33 + 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 34 + 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 34 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 34 - 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 34 + 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 35 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 35 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 35 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 36 - 15. Security Considerations . . . . . . . . . . . . . . . . . . . 36 - 15.1. RPCSEC_GSS and Security Services . . . . . . . . . . . . 37 - 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 37 - 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 38 - 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 - 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 39 - 17.1. Normative References . . . . . . . . . . . . . . . . . . 39 - 17.2. Informative References . . . . . . . . . . . . . . . . . 40 - Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 40 - Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 41 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 41 + 15. Security Considerations . . . . . . . . . . . . . . . . . . . 37 + 15.1. RPCSEC_GSS and Security Services . . . . . . . . . . . . 38 + 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 38 + 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 39 + 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 + 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 40 + 17.1. Normative References . . . . . . . . . . . . . . . . . . 40 + 17.2. Informative References . . . . . . . . . . . . . . . . . 41 + Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 41 + Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 42 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 42 1. Introduction In the parallel Network File System (pNFS), the metadata server returns layout type structures that describe where file data is located. There are different layout types for different storage systems and methods of arranging data on storage devices. This document defines the flexible file layout type used with file-based data servers that are accessed using the Network File System (NFS) protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC5661], and NFSv4.2 [RFC7862]. To provide a global state model equivalent to that of the files layout type, a back-end control protocol might be implemented between the metadata server and NFSv4.1+ storage devices. This document does not provide a standard track control protocol. An implementation can either define its own mechanism or it could define a control protocol in a standard's track document. The requirements for a control protocol are specified in [RFC5661] and clarified in [pNFSLayouts]. + The control protocol described in this document is based on NFS. The + storage devices are configured such that the metadata server has full + access rights to the data file system and then the metadata server + uses synthetic ids to control client access to individual files. + + In traditional mirroring of data, the server is responsible for + replicating, validating, and repairing copies of the data file. With + client-side mirroring, the metadata server provides a layout which + presents the available mirrors to the client. It is then the client + which picks a mirror to read from and ensures that all writes go to + all mirrors. Only if all mirrors are successfully updated, does the + client consider the write transaction to have succeeded. In case of + error, the client can use the LAYOUTERROR operation to inform the + metadata server, which is then responsible for the repairing of the + mirrored copies of the file. + 1.1. Definitions control communication requirements: are for a layout type the details regarding information on layouts, stateids, file metadata, and file data which must be communicated between the metadata server and the storage devices. control protocol: is the particular mechanism that an implementation of a layout type would use to meet the control communication requirement for that layout type. This need not be a protocol as @@ -155,20 +171,23 @@ data server (DS): is another term for storage device. fencing: is the process by which the metadata server prevents the storage devices from processing I/O from a specific client to a specific file. file layout type: is a layout type in which the storage devices are accessed via the NFS protocol (see Section 13 of [RFC5661]). + gid: is the group id, a numeric value which identifies to which + group a file belongs. + layout: is the information a client uses to access file data on a storage device. This information will include specification of the protocol (layout type) and the identity of the storage devices to be used. layout iomode: is a grant of either read or read/write I/O to the client. layout segment: is a sub-division of a layout. That sub-division might be by the layout iomode (see Sections 3.3.20 and 12.2.9 of @@ -230,35 +250,41 @@ storage protocol: is the protocol used by clients to do I/O operations to the storage device. Each layout type specifies the set of storage protocols. tight coupling: is an arrangement in which the control protocol is one designed specifically for that purpose. It may be either a proprietary protocol, adapted specifically to a a particular metadata server, or one based on a standards-track document. + uid: is the used id, a numeric value which identifies which user + owns a file. + wsize: is the data transfer buffer size used for writes. 1.2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 2. Coupling of Storage Devices A server implementation may choose either a loose or tight coupling - model between the metadata server and the storage devices. To - implement the tight coupling model, a control protocol has to be - defined. As the flex file layout imposes no special requirements on - the client, the control protocol will need to provide: + model between the metadata server and the storage devices. + [pNFSLayouts] describes the general problems facing pNFS + implementations. This document details how the new Flexible File + Layout Type addresses these issues. To implement the tight coupling + model, a control protocol has to be defined. As the flex file layout + imposes no special requirements on the client, the control protocol + will need to provide: (1) for the management of both security and LAYOUTCOMMITs, and, (2) a global stateid model and management of these stateids. When implementing the loose coupling model, the only control protocol will be a version of NFS, with no ability to provide a global stateid model or to prevent clients from using layouts inappropriately. To enable client use in that environment, this document will specify how security, state, and locking are to be managed. @@ -277,46 +303,51 @@ about the changes to the file. If any WRITE to a storage device did not result with stable_how equal to FILE_SYNC, a LAYOUTCOMMIT to the metadata server MUST be preceded by a COMMIT to the storage devices written to. Note that if the client has not done a COMMIT to the storage device, then the LAYOUTCOMMIT might not be synchronized to the last WRITE operation to the storage device. 2.2. Fencing Clients from the Storage Device With loosely coupled storage devices, the metadata server uses - synthetic uids and gids for the data file, where the uid owner of the - data file is allowed read/write access and the gid owner is allowed - read only access. As part of the layout (see ffds_user and - ffds_group in Section 5.1), the client is provided with the user and - group to be used in the Remote Procedure Call (RPC) [RFC5531] - credentials needed to access the data file. Fencing off of clients - is achieved by the metadata server changing the synthetic uid and/or - gid owners of the data file on the storage device to implicitly - revoke the outstanding RPC credentials. A client presenting the - wrong credential for the desired access will get a NFS4ERR_ACCESS - error. + synthetic uids (user ids) and gids (group ids) for the data file, + where the uid owner of the data file is allowed read/write access and + the gid owner is allowed read only access. As part of the layout + (see ffds_user and ffds_group in Section 5.1), the client is provided + with the user and group to be used in the Remote Procedure Call (RPC) + [RFC5531] credentials needed to access the data file. Fencing off of + clients is achieved by the metadata server changing the synthetic uid + and/or gid owners of the data file on the storage device to + implicitly revoke the outstanding RPC credentials. A client + presenting the wrong credential for the desired access will get a + NFS4ERR_ACCESS error. With this loosely coupled model, the metadata server is not able to fence off a single client, it is forced to fence off all clients. However, as the other clients react to the fencing, returning their layouts and trying to get new ones, the metadata server can hand out a new uid and gid to allow access. - Note: it is recommended to implement common access control methods at - the storage device filesystem to allow only the metadata server root + It is RECOMMENDED to implement common access control methods at the + storage device filesystem to allow only the metadata server root (super user) access to the storage device, and to set the owner of all directories holding data files to the root user. This approach provides a practical model to enforce access control and fence off cooperative clients, but it can not protect against malicious clients; hence it provides a level of security equivalent to - AUTH_SYS. + AUTH_SYS. It is RECOMMENDED that the communication between the + metadata server and storage device be secure from eavesdroppers and + man-in-the-middle protocol tampering. The security measure could be + due to physical security (e.g., the servers are co-located in a + physically secure area), from encrypted communications, or some other + technique. With tightly coupled storage devices, the metadata server sets the user and group owners, mode bits, and ACL of the data file to be the same as the metadata file. And the client must authenticate with the storage device and go through the same authorization process it would go through via the metadata server. In the case of tight coupling, fencing is the responsibility of the control protocol and is not described in detail here. However, implementations of the tight coupling locking model (see Section 2.3), will need a way to prevent access by certain clients to specific files by invalidating the @@ -364,31 +395,38 @@ 2.2.2. Example of using Synthetic uids/gids The user loghyr creates a file "ompha.c" on the metadata server and it creates a corresponding data file on the storage device. The metadata server entry may look like: -rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c - On the storage device, it may be assigned some random synthetic uid/ - gid to deny access: + On the storage device, it may be assigned some unpredictable + synthetic uid/gid to deny access: -rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c - When the file is opened on a client, since the layout knows nothing - about the user (and does not care), whether loghyr or garbo opens the - file does not matter. The owner and group are modified and those - values are returned. + When the file is opened on a client and accessed, it will try to get + a layout for the data file. Since the layout knows nothing about the + user (and does not care), whether the user loghyr or garbo opens the + file does not matter. The client has to present an uid of 19452 to + get write permission. If it presents any other value for the uid, + then it must give a gid of 28418 to get read access. - -rw-r----- 1 1066 1067 1697 Dec 4 11:31 data_ompha.c + Further, if the metadata server decides to fence the file, it should + change the uid and/or gid such that these values neither match + earlier values for that file nor match a predictable change based on + an earlier fencing. + + -rw-r----- 1 19453 28419 1697 Dec 4 11:31 data_ompha.c The set of synthetic gids on the storage device should be selected such that there is no mapping in any of the name services used by the storage device. I.e., each group should have no members. If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the metadata server should return a synthetic uid that is not set on the storage device. Only the synthetic gid would be valid. The client is thus solely responsible for enforcing file permissions @@ -1192,21 +1231,21 @@ this case, the storage device MUST ensure that all copies of the mirror are updated when any one of the mirrors is updated. If the storage device gets an error when updating one of the mirrors, then it MUST inform the client that the original WRITE had an error. The client then MUST inform the metadata server (see Section 8.2.3). The client's responsibility with respect to COMMIT is explained in Section 8.2.4. The client may choose any one of the mirrors and may use ffds_efficiency in the same manner as for reading when making this choice. -8.2.2. Single Storage Device Updates Mirrors +8.2.2. Client Updates All Mirrors If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the client is responsible for updating all mirrored copies of the layout segments that it is given in the layout. A single failed update is sufficient to fail the entire operation. If all but one copy is updated successfully and the last one provides an error, then the client needs to inform the metadata server about the error via either LAYOUTRETURN or LAYOUTERROR that the update failed to that storage device. If the client is updating the mirrors serially, then it SHOULD stop at the first error encountered and report that to the @@ -1697,20 +1736,45 @@ server verifies the permissions and ACL for these credentials, possibly returning NFS4ERR_ACCESS if the client is not allowed the requested iomode. If the LAYOUTGET operation succeeds the client receives, as part of the layout, a set of credentials allowing it I/O access to the specified data files corresponding to the requested iomode. When the client acts on I/O operations on behalf of its local users, it MUST authenticate and authorize the user by issuing respective OPEN and ACCESS calls to the metadata server, similar to having NFSv4 data delegations. + The combination of file handle, synthetic uid, and gid in the layout + are the way that the metadata server enforces access control to the + data server. The directory namespace on the storage device SHOULD + only be accessible to the metadata server and not the clients. In + that case, the client only has access to file handles of file objects + and not directory objects. Thus, given a file handle in a layout, it + is not possible to guess the parent directory file handle. Further, + as the data file permissions only allow the given synthetic uid read/ + write permission and the given synthetic gid read permission, knowing + the synthetic ids of one file does not necessarily allow access to + any other data file on the storage device. + + The metadata server can also deny access at any time by fencing the + data file, which means changing the synthetic ids. In turn, that + forces the client to return its current layout and get a new layout + if it wants to continue IO to the data file. + + If the configuration of the storage device is such that clients can + access the directory namespace, then the access control degrades to + that of a typical NFS server with exports with a security flavor of + AUTH_SYS. Any client which is allowed access can forge credentials + to access any data file. The caveat is that the rogue client might + have no knowledge of the data file's type or position in the metadata + directory namespace. + If access is allowed, the client uses the corresponding (READ or RW) credentials to perform the I/O operations at the data file's storage devices. When the metadata server receives a request to change a file's permissions or ACL, it SHOULD recall all layouts for that file and then MUST fence off any clients still holding outstanding layouts for the respective files by implicitly invalidating the previously distributed credential on all data file comprising the file in question. It is REQUIRED that this be done before committing to the new permissions and/or ACL. By requesting new layouts, the clients will reauthorize access against the modified access control metadata.