draft-ietf-nfsv4-rfc5664bis-02.txt   draft-ietf-nfsv4-rfc5664bis-03.txt 
NFSv4 B. Halevy NFSv4 B. Halevy
Internet-Draft PrimaryData Internet-Draft PrimaryData
Intended status: Standards Track B. Harrosh Intended status: Standards Track B. Harrosh
Expires: April 8, 2014 B. Welch Expires: November 2, 2014 B. Welch
B. Mueller
Panasas Panasas
October 05, 2013 May 01, 2014
Object-Based Parallel NFS (pNFS) Operations Object-Based Parallel NFS (pNFS) Operations
draft-ietf-nfsv4-rfc5664bis-02 draft-ietf-nfsv4-rfc5664bis-03
Abstract Abstract
Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to
allow clients to directly access file data on the storage used by the allow clients to directly access file data on the storage used by the
NFSv4 server. This ability to bypass the server for data access can NFSv4 server. This ability to bypass the server for data access can
increase both performance and parallelism, but requires additional increase both performance and parallelism, but requires additional
client functionality for data access, some of which is dependent on client functionality for data access, some of which is dependent on
the class of storage used, a.k.a. the Layout Type. The main pNFS the class of storage used, a.k.a. the Layout Type. The main pNFS
operations and data types in NFSv4 Minor version 1 specify a layout- operations and data types in NFSv4 Minor version 1 specify a layout-
type-independent layer; layout-type-specific information is conveyed type-independent layer; layout-type-specific information is conveyed
using opaque data structures whose internal structure is further using opaque data structures whose internal structure is further
defined by the particular layout type specification. This document defined by the particular layout type specification. This document
specifies the NFSv4.1 Object-Based pNFS Layout Type as a companion to specifies the NFSv4.1 Object-Based pNFS Layout Type as a companion to
the main NFSv4 Minor version 1 specification. This document has been the main NFSv4 Minor version 1 specification. This document has been
updated since the initial version to clarify and fix some of the updated since the initial version to clarify and fix some of the
RAID-related computations so they match current implementations. RAID-related computations so they match current implementations.
Status of this Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 8, 2014. This Internet-Draft will expire on November 2, 2014.
Copyright Notice Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4
1.2. Overview of Changes . . . . . . . . . . . . . . . . . . . 4 1.2. Overview of Changes . . . . . . . . . . . . . . . . . . . 4
2. XDR Description of the Objects-Based Layout Protocol . . . . . 4 2. XDR Description of the Objects-Based Layout Protocol . . . . 4
2.1. Code Components Licensing Notice . . . . . . . . . . . . . 5 2.1. Code Components Licensing Notice . . . . . . . . . . . . 5
3. Basic Data Type Definitions . . . . . . . . . . . . . . . . . 6 3. Basic Data Type Definitions . . . . . . . . . . . . . . . . . 6
3.1. pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . . . 6 3.1. pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . . . 6
3.2. pnfs_osd_version4 . . . . . . . . . . . . . . . . . . . . 7 3.2. pnfs_osd_version4 . . . . . . . . . . . . . . . . . . . . 6
3.3. pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . . . 8 3.3. pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . . . 7
3.4. pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . . . 9 3.4. pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . . 8
4. Object Storage Device Addressing and Discovery . . . . . . . . 9 4. Object Storage Device Addressing and Discovery . . . . . . . 9
4.1. pnfs_osd_targetid_type4 . . . . . . . . . . . . . . . . . 10 4.1. pnfs_osd_targetid_type4 . . . . . . . . . . . . . . . . . 10
4.2. pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . . 11 4.2. pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . 10
4.2.1. SCSI Target Identifier . . . . . . . . . . . . . . . . 11 4.2.1. SCSI Target Identifier . . . . . . . . . . . . . . . 11
4.2.2. Device Network Address . . . . . . . . . . . . . . . . 12 4.2.2. Device Network Address . . . . . . . . . . . . . . . 12
5. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 13 5. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 12
5.1. pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . . . 13 5.1. pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . . 13
5.2. pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 14 5.2. pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . 14
5.3. Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 15 5.3. Data Mapping Schemes . . . . . . . . . . . . . . . . . . 15
5.3.1. Simple Striping . . . . . . . . . . . . . . . . . . . 15 5.3.1. Simple Striping . . . . . . . . . . . . . . . . . . . 15
5.3.2. Nested Striping . . . . . . . . . . . . . . . . . . . 16 5.3.2. Nested Striping . . . . . . . . . . . . . . . . . . . 16
5.3.3. Mirroring . . . . . . . . . . . . . . . . . . . . . . 18 5.3.3. Mirroring . . . . . . . . . . . . . . . . . . . . . . 18
5.4. RAID Algorithms . . . . . . . . . . . . . . . . . . . . . 19 5.4. RAID Algorithms . . . . . . . . . . . . . . . . . . . . . 19
5.4.1. PNFS_OSD_RAID_0 . . . . . . . . . . . . . . . . . . . 19 5.4.1. PNFS_OSD_RAID_0 . . . . . . . . . . . . . . . . . . . 19
5.4.2. PNFS_OSD_RAID_4 . . . . . . . . . . . . . . . . . . . 19 5.4.2. PNFS_OSD_RAID_4 . . . . . . . . . . . . . . . . . . . 20
5.4.3. PNFS_OSD_RAID_5 . . . . . . . . . . . . . . . . . . . 20 5.4.3. PNFS_OSD_RAID_5 . . . . . . . . . . . . . . . . . . . 20
5.4.4. PNFS_OSD_RAID_PQ . . . . . . . . . . . . . . . . . . . 21 5.4.4. PNFS_OSD_RAID_PQ . . . . . . . . . . . . . . . . . . 21
5.4.5. RAID Usage and Implementation Notes . . . . . . . . . 22 5.4.5. RAID Usage and Implementation Notes . . . . . . . . . 22
6. Object-Based Layout Update . . . . . . . . . . . . . . . . . . 22 6. Object-Based Layout Update . . . . . . . . . . . . . . . . . 22
6.1. pnfs_osd_deltaspaceused4 . . . . . . . . . . . . . . . . . 23 6.1. pnfs_osd_deltaspaceused4 . . . . . . . . . . . . . . . . 23
6.2. pnfs_osd_layoutupdate4 . . . . . . . . . . . . . . . . . . 23 6.2. pnfs_osd_layoutupdate4 . . . . . . . . . . . . . . . . . 23
7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 24
8. Object-Based Layout Return . . . . . . . . . . . . . . . . . . 24 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 24
8.1. pnfs_osd_errno4 . . . . . . . . . . . . . . . . . . . . . 25 8. Object-Based Layout Return . . . . . . . . . . . . . . . . . 24
8.2. pnfs_osd_ioerr4 . . . . . . . . . . . . . . . . . . . . . 26 8.1. pnfs_osd_errno4 . . . . . . . . . . . . . . . . . . . . . 25
8.3. pnfs_osd_layoutreturn4 . . . . . . . . . . . . . . . . . . 27 8.2. pnfs_osd_ioerr4 . . . . . . . . . . . . . . . . . . . . . 26
9. Object-Based Creation Layout Hint . . . . . . . . . . . . . . 27 8.3. pnfs_osd_layoutreturn4 . . . . . . . . . . . . . . . . . 27
9.1. pnfs_osd_layouthint4 . . . . . . . . . . . . . . . . . . . 27 9. Object-Based Creation Layout Hint . . . . . . . . . . . . . . 27
10. Layout Segments . . . . . . . . . . . . . . . . . . . . . . . 29 9.1. pnfs_osd_layouthint4 . . . . . . . . . . . . . . . . . . 27
10.1. CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . . 29 10. Layout Segments . . . . . . . . . . . . . . . . . . . . . . . 29
10.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . 30 10.1. CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . 29
11. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 30 10.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 30
11.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . . 30 11. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 30
12. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . . 31 11.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 30
13. Security Considerations . . . . . . . . . . . . . . . . . . . 31 12. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 31
13.1. OSD Security Data Types . . . . . . . . . . . . . . . . . 32 13. Security Considerations . . . . . . . . . . . . . . . . . . . 31
13.2. The OSD Security Protocol . . . . . . . . . . . . . . . . 33 13.1. OSD Security Data Types . . . . . . . . . . . . . . . . 32
13.3. Protocol Privacy Requirements . . . . . . . . . . . . . . 34 13.2. The OSD Security Protocol . . . . . . . . . . . . . . . 33
13.4. Revoking Capabilities . . . . . . . . . . . . . . . . . . 34 13.3. Protocol Privacy Requirements . . . . . . . . . . . . . 34
14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 35 13.4. Revoking Capabilities . . . . . . . . . . . . . . . . . 34
15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35 14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 35
15.1. Normative References . . . . . . . . . . . . . . . . . . . 35 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 35
15.2. Informative References . . . . . . . . . . . . . . . . . . 36 15.1. Normative References . . . . . . . . . . . . . . . . . . 35
Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 37 15.2. Informative References . . . . . . . . . . . . . . . . . 36
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 37 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 37
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 37
1. Introduction 1. Introduction
In pNFS, the file server returns typed layout structures that In pNFS, the file server returns typed layout structures that
describe where file data is located. There are different layouts for describe where file data is located. There are different layouts for
different storage systems and methods of arranging data on storage different storage systems and methods of arranging data on storage
devices. This document describes the layouts used with object-based devices. This document describes the layouts used with object-based
storage devices (OSDs) that are accessed according to the OSD storage storage devices (OSDs) that are accessed according to the OSD storage
protocol standard (ANSI INCITS 400-2004 [1]). protocol standard (ANSI INCITS 400-2004 [1]).
skipping to change at page 4, line 46 skipping to change at page 4, line 22
This document is an update to the initial RFC. The primary area for This document is an update to the initial RFC. The primary area for
changes are the clarification and correction of the RAID-related changes are the clarification and correction of the RAID-related
equations and algorithms in Section 5.3. The equations were restated equations and algorithms in Section 5.3. The equations were restated
for clarity, and in a few places minor corrections were made to for clarity, and in a few places minor corrections were made to
ensure that this spec accurately matches current implementations. In ensure that this spec accurately matches current implementations. In
addition, minor corrections have been made to other sections. addition, minor corrections have been made to other sections.
2. XDR Description of the Objects-Based Layout Protocol 2. XDR Description of the Objects-Based Layout Protocol
This document contains the external data representation (XDR [3]) This document contains the external data representation (XDR [6])
description of the NFSv4.1 objects layout protocol. The XDR description of the NFSv4.1 objects layout protocol. The XDR
description is embedded in this document in a way that makes it description is embedded in this document in a way that makes it
simple for the reader to extract into a ready-to-compile form. The simple for the reader to extract into a ready-to-compile form. The
reader can feed this document into the following shell script to reader can feed this document into the following shell script to
produce the machine readable XDR description of the NFSv4.1 objects produce the machine readable XDR description of the NFSv4.1 objects
layout protocol: layout protocol:
#!/bin/sh #!/bin/sh
grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
skipping to change at page 5, line 22 skipping to change at page 4, line 46
sh extract.sh < spec.txt > pnfs_osd_prot.x sh extract.sh < spec.txt > pnfs_osd_prot.x
The effect of the script is to remove leading white space from each The effect of the script is to remove leading white space from each
line, plus a sentinel sequence of "///". line, plus a sentinel sequence of "///".
The embedded XDR file header follows. Subsequent XDR descriptions, The embedded XDR file header follows. Subsequent XDR descriptions,
with the sentinel sequence are embedded throughout the document. with the sentinel sequence are embedded throughout the document.
Note that the XDR code contained in this document depends on types Note that the XDR code contained in this document depends on types
from the NFSv4.1 nfs4_prot.x file ([4]). This includes both nfs from the NFSv4.1 nfs4_prot.x file ([5]). This includes both nfs
types that end with a 4, such as offset4, length4, etc., as well as types that end with a 4, such as offset4, length4, etc., as well as
more generic types such as uint32_t and uint64_t. more generic types such as uint32_t and uint64_t.
2.1. Code Components Licensing Notice 2.1. Code Components Licensing Notice
The XDR description, marked with lines beginning with the sequence The XDR description, marked with lines beginning with the sequence "/
"///", as well as scripts for extracting the XDR description are Code //", as well as scripts for extracting the XDR description are Code
Components as described in Section 4 of "Legal Provisions Relating to Components as described in Section 4 of "Legal Provisions Relating to
IETF Documents" [5]. These Code Components are licensed according to IETF Documents" [3]. These Code Components are licensed according to
the terms of Section 4 of "Legal Provisions Relating to IETF the terms of Section 4 of "Legal Provisions Relating to IETF
Documents". Documents".
/// /* /// /*
/// * Copyright (c) 2010 IETF Trust and the persons identified /// * Copyright (c) 2010 IETF Trust and the persons identified
/// * as authors of the code. All rights reserved. /// * as authors of the code. All rights reserved.
/// * /// *
/// * Redistribution and use in source and binary forms, with /// * Redistribution and use in source and binary forms, with
/// * or without modification, are permitted provided that the /// * or without modification, are permitted provided that the
/// * following conditions are met: /// * following conditions are met:
skipping to change at page 6, line 25 skipping to change at page 5, line 52
/// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
/// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
/// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
/// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
/// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
/// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
/// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
/// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
/// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
/// * /// *
/// * This code was derived from draft-ietf-nfsv4-rfc5664bis-02. /// * This code was derived from draft-ietf-nfsv4-rfc5664bis-03.
[[RFC Editor: please insert RFC number if needed]] [[RFC Editor: please insert RFC number if needed]]
/// * Please reproduce this note if possible. /// * Please reproduce this note if possible.
/// */ /// */
/// ///
/// /* /// /*
/// * pnfs_osd_prot.x /// * pnfs_osd_prot.x
/// */ /// */
/// ///
/// %#include <nfs4_prot.x> /// %#include <nfs4_prot.x>
/// ///
skipping to change at page 7, line 13 skipping to change at page 6, line 35
within an object storage device are grouped into partitions. within an object storage device are grouped into partitions.
/// struct pnfs_osd_objid4 { /// struct pnfs_osd_objid4 {
/// deviceid4 oid_device_id; /// deviceid4 oid_device_id;
/// uint64_t oid_partition_id; /// uint64_t oid_partition_id;
/// uint64_t oid_object_id; /// uint64_t oid_object_id;
/// }; /// };
/// ///
The pnfs_osd_objid4 type is used to identify an object within a The pnfs_osd_objid4 type is used to identify an object within a
partition on a specified object storage device. "oid_device_id" partition on a specified object storage device. "oid_device_id"
selects the object storage device from the set of available storage selects the object storage device from the set of available storage
devices. The device is identified with the deviceid4 type, which is devices. The device is identified with the deviceid4 type, which is
an index into addressing information about that device returned by an index into addressing information about that device returned by
the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data
type is defined in NFSv4.1 [6]. Within an OSD, a partition is type is defined in NFSv4.1 [4]. Within an OSD, a partition is
identified with a 64-bit number, "oid_partition_id". Within a identified with a 64-bit number, "oid_partition_id". Within a
partition, an object is identified with a 64-bit number, partition, an object is identified with a 64-bit number,
"oid_object_id". Creation and management of partitions is outside "oid_object_id". Creation and management of partitions is outside
the scope of this document, and is a facility provided by the object- the scope of this document, and is a facility provided by the object-
based storage file system. based storage file system.
3.2. pnfs_osd_version4 3.2. pnfs_osd_version4
/// enum pnfs_osd_version4 { /// enum pnfs_osd_version4 {
/// PNFS_OSD_MISSING = 0, /// PNFS_OSD_MISSING = 0,
/// PNFS_OSD_VERSION_1 = 1, /// PNFS_OSD_VERSION_1 = 1,
/// PNFS_OSD_VERSION_2 = 2 /// PNFS_OSD_VERSION_2 = 2
/// }; /// };
/// ///
pnfs_osd_version4 is used to indicate the OSD protocol version used pnfs_osd_version4 is used to indicate the OSD protocol version used
to access an object, or whether an object is missing (i.e., to access an object, or whether an object is missing (i.e.,
unavailable). Some of the RAID algorithms supported by object-based unavailable). Some of the RAID algorithms supported by object-based
skipping to change at page 10, line 15 skipping to change at page 9, line 45
In some situations, SCSI target discovery may need to be driven based In some situations, SCSI target discovery may need to be driven based
on information contained in the GETDEVICEINFO response. One example on information contained in the GETDEVICEINFO response. One example
of this is Internet SCSI (iSCSI) targets that are not known to the of this is Internet SCSI (iSCSI) targets that are not known to the
client until a layout has been requested. The information provided client until a layout has been requested. The information provided
as the "oda_targetid", "oda_targetaddr", and "oda_lun" fields in the as the "oda_targetid", "oda_targetaddr", and "oda_lun" fields in the
pnfs_osd_deviceaddr4 type described below (see Section 4.2) allows pnfs_osd_deviceaddr4 type described below (see Section 4.2) allows
the client to probe a specific device given its network address and the client to probe a specific device given its network address and
optionally its iSCSI Name (see iSCSI [8]), or when the device network optionally its iSCSI Name (see iSCSI [8]), or when the device network
address is omitted, allows it to discover the object storage device address is omitted, allows it to discover the object storage device
using the provided device name or SCSI Device Identifier (see SPC-3 using the provided device name or SCSI Device Identifier (see SPC-3
[9].) [10].)
The oda_systemid is implicitly used by the client, by using the The oda_systemid is implicitly used by the client, by using the
object credential signing key to sign each request with the request object credential signing key to sign each request with the request
integrity check value. This method protects the client from integrity check value. This method protects the client from
unintentionally accessing a device if the device address mapping was unintentionally accessing a device if the device address mapping was
changed (or revoked). The server computes the capability key using changed (or revoked). The server computes the capability key using
its own view of the systemid associated with the respective deviceid its own view of the systemid associated with the respective deviceid
present in the credential. If the client's view of the deviceid present in the credential. If the client's view of the deviceid
mapping is stale, the client will use the wrong systemid (which must mapping is stale, the client will use the wrong systemid (which must
be system-wide unique) and the I/O request to the OSD will fail to be system-wide unique) and the I/O request to the OSD will fail to
skipping to change at page 11, line 10 skipping to change at page 10, line 41
/// OBJ_TARGET_SCSI_NAME = 2, /// OBJ_TARGET_SCSI_NAME = 2,
/// OBJ_TARGET_SCSI_DEVICE_ID = 3 /// OBJ_TARGET_SCSI_DEVICE_ID = 3
/// }; /// };
/// ///
4.2. pnfs_osd_deviceaddr4 4.2. pnfs_osd_deviceaddr4
The "pnfs_osd_deviceaddr4" data structure is returned by the server The "pnfs_osd_deviceaddr4" data structure is returned by the server
as the storage-protocol-specific opaque field da_addr_body in the as the storage-protocol-specific opaque field da_addr_body in the
"device_addr4" structure by a successful GETDEVICEINFO operation "device_addr4" structure by a successful GETDEVICEINFO operation
NFSv4.1 [6]. NFSv4.1 [4].
The specification for an object device address is as follows: The specification for an object device address is as follows:
/// union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) { /// union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) {
/// case OBJ_TARGET_SCSI_NAME: /// case OBJ_TARGET_SCSI_NAME:
/// string oti_scsi_name<>; /// string oti_scsi_name<>;
/// ///
/// case OBJ_TARGET_SCSI_DEVICE_ID: /// case OBJ_TARGET_SCSI_DEVICE_ID:
/// opaque oti_scsi_device_id<>; /// opaque oti_scsi_device_id<>;
/// ///
skipping to change at page 11, line 46 skipping to change at page 11, line 37
/// opaque oda_systemid<>; /// opaque oda_systemid<>;
/// pnfs_osd_object_cred4 oda_root_obj_cred; /// pnfs_osd_object_cred4 oda_root_obj_cred;
/// opaque oda_osdname<>; /// opaque oda_osdname<>;
/// }; /// };
/// ///
4.2.1. SCSI Target Identifier 4.2.1. SCSI Target Identifier
When "oda_targetid" is specified as an OBJ_TARGET_SCSI_NAME, the When "oda_targetid" is specified as an OBJ_TARGET_SCSI_NAME, the
"oti_scsi_name" string MUST be formatted as an "iSCSI Name" as "oti_scsi_name" string MUST be formatted as an "iSCSI Name" as
specified in iSCSI [8] and [10]. Note that the specification of the specified in iSCSI [8] and [9]. Note that the specification of the
oti_scsi_name string format is outside the scope of this document. oti_scsi_name string format is outside the scope of this document.
Parsing the string is based on the string prefix, e.g., "iqn.", Parsing the string is based on the string prefix, e.g., "iqn.",
"eui.", or "naa." and more formats MAY be specified in the future in "eui.", or "naa." and more formats MAY be specified in the future in
accordance with iSCSI Names properties. accordance with iSCSI Names properties.
Currently, the iSCSI Name provides for naming the target device using Currently, the iSCSI Name provides for naming the target device using
a string formatted as an iSCSI Qualified Name (IQN) or as an Extended a string formatted as an iSCSI Qualified Name (IQN) or as an Extended
Unique Identifier (EUI) [11] string. Those are typically used to Unique Identifier (EUI) [13] string. Those are typically used to
identify iSCSI or Secure Routing Protocol (SRP) [16] devices. The identify iSCSI or Secure Routing Protocol (SRP) [20] devices. The
Network Address Authority (NAA) string format (see [10]) provides for Network Address Authority (NAA) string format (see [9]) provides for
naming the device using globally unique identifiers, as defined in naming the device using globally unique identifiers, as defined in
Fibre Channel Framing and Signaling (FC-FS) [17]. These are Fibre Channel Framing and Signaling (FC-FS) [21]. These are
typically used to identify Fibre Channel or SAS [18] (Serial Attached typically used to identify Fibre Channel or SAS [22] (Serial Attached
SCSI) devices. In particular, such devices that are dual-attached SCSI) devices. In particular, such devices that are dual-attached
both over Fibre Channel or SAS and over iSCSI. both over Fibre Channel or SAS and over iSCSI.
When "oda_targetid" is specified as an OBJ_TARGET_SCSI_DEVICE_ID, the When "oda_targetid" is specified as an OBJ_TARGET_SCSI_DEVICE_ID, the
"oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device
Identifier as defined in SPC-3 [9] VPD Page 83h (Section 7.6.3. Identifier as defined in SPC-3 [10] VPD Page 83h (Section 7.6.3.
"Device Identification VPD Page"). If the Device Identifier is "Device Identification VPD Page"). If the Device Identifier is
identical to the OSD System ID, as given by oda_systemid, the server identical to the OSD System ID, as given by oda_systemid, the server
SHOULD provide a zero-length oti_scsi_device_id opaque value. Note SHOULD provide a zero-length oti_scsi_device_id opaque value. Note
that similarly to the "oti_scsi_name", the specification of the that similarly to the "oti_scsi_name", the specification of the
oti_scsi_device_id opaque contents is outside the scope of this oti_scsi_device_id opaque contents is outside the scope of this
document and more formats MAY be specified in the future in document and more formats MAY be specified in the future in
accordance with SPC-3. accordance with SPC-3.
The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing
no target identification. In this case, only the OSD System ID, and no target identification. In this case, only the OSD System ID, and
optionally the provided network address, are used to locate the optionally the provided network address, are used to locate the
device. device.
4.2.2. Device Network Address 4.2.2. Device Network Address
The optional "oda_targetaddr" field MAY be provided by the server as The optional "oda_targetaddr" field MAY be provided by the server as
a hint to accelerate device discovery over, e.g., the iSCSI transport a hint to accelerate device discovery over, e.g., the iSCSI transport
protocol. The network address is given with the netaddr4 type, which protocol. The network address is given with the netaddr4 type, which
specifies a TCP/IP based endpoint (as specified in NFSv4.1 [6]). specifies a TCP/IP based endpoint (as specified in NFSv4.1 [4]).
When given, the client SHOULD use it to probe for the SCSI device at When given, the client SHOULD use it to probe for the SCSI device at
the given network address. The client MAY still use other discovery the given network address. The client MAY still use other discovery
mechanisms such as Internet Storage Name Service (iSNS) [12] to mechanisms such as Internet Storage Name Service (iSNS) [12] to
locate the device using the oda_targetid. In particular, such an locate the device using the oda_targetid. In particular, such an
external name service SHOULD be used when the devices may be attached external name service SHOULD be used when the devices may be attached
to the network using multiple connections, and/or multiple storage to the network using multiple connections, and/or multiple storage
fabrics (e.g., Fibre-Channel and iSCSI). fabrics (e.g., Fibre-Channel and iSCSI).
The "oda_lun" field identifies the OSD 64-bit Logical Unit Number, The "oda_lun" field identifies the OSD 64-bit Logical Unit Number,
formatted in accordance with SAM-3 [13]. The client uses the Logical formatted in accordance with SAM-3 [11]. The client uses the Logical
Unit Number to communicate with the specific OSD Logical Unit. Its Unit Number to communicate with the specific OSD Logical Unit. Its
use is defined in detail by the SCSI transport protocol, e.g., iSCSI use is defined in detail by the SCSI transport protocol, e.g., iSCSI
[8]. [8].
5. Object-Based Layout 5. Object-Based Layout
The layout4 type is defined in the NFSv4.1 [6] as follows: The layout4 type is defined in the NFSv4.1 [4] as follows:
enum layouttype4 { enum layouttype4 {
LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_NFSV4_1_FILES = 1,
LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_OSD2_OBJECTS = 2,
LAYOUT4_BLOCK_VOLUME = 3 LAYOUT4_BLOCK_VOLUME = 3
}; };
struct layout_content4 { struct layout_content4 {
layouttype4 loc_type; layouttype4 loc_type;
opaque loc_body<>; opaque loc_body<>;
}; };
struct layout4 { struct layout4 {
offset4 lo_offset; offset4 lo_offset;
length4 lo_length; length4 lo_length;
layoutiomode4 lo_iomode; layoutiomode4 lo_iomode;
layout_content4 lo_content; layout_content4 lo_content;
}; };
This document defines structure associated with the layouttype4 This document defines structure associated with the layouttype4
value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 [6] specifies the loc_body value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 [4] specifies the loc_body
structure as an XDR type "opaque". The opaque layout is structure as an XDR type "opaque". The opaque layout is
uninterpreted by the generic pNFS client layers, but obviously must uninterpreted by the generic pNFS client layers, but obviously must
be interpreted by the object storage layout driver. This section be interpreted by the object storage layout driver. This section
defines the structure of this opaque value, pnfs_osd_layout4. defines the structure of this opaque value, pnfs_osd_layout4.
5.1. pnfs_osd_data_map4 5.1. pnfs_osd_data_map4
/// struct pnfs_osd_data_map4 { /// struct pnfs_osd_data_map4 {
/// uint32_t odm_num_comps; /// uint32_t odm_num_comps;
/// length4 odm_stripe_unit; /// length4 odm_stripe_unit;
skipping to change at page 15, line 21 skipping to change at page 15, line 19
GETATTR commands to the metadata server. The client uses the file GETATTR commands to the metadata server. The client uses the file
size to decide if it should fill holes with zeros or return a short size to decide if it should fill holes with zeros or return a short
read. Striping patterns can cause cases where component objects are read. Striping patterns can cause cases where component objects are
shorter than other components because a hole happens to correspond to shorter than other components because a hole happens to correspond to
the last part of the component object. the last part of the component object.
5.3. Data Mapping Schemes 5.3. Data Mapping Schemes
This section describes the different data mapping schemes in detail. This section describes the different data mapping schemes in detail.
The object layout always uses a "dense" layout as described in The object layout always uses a "dense" layout as described in
NFSv4.1 [6]. This means that the second stripe unit of the file NFSv4.1 [4]. This means that the second stripe unit of the file
starts at offset 0 of the second component, rather than at offset starts at offset 0 of the second component, rather than at offset
stripe_unit bytes. After a full stripe has been written, the next stripe_unit bytes. After a full stripe has been written, the next
stripe unit is appended to the first component object in the list stripe unit is appended to the first component object in the list
without any holes in the component objects. without any holes in the component objects.
5.3.1. Simple Striping 5.3.1. Simple Striping
The mapping from the logical offset within a file (L) to the The mapping from the logical offset within a file (L) to the
component object C and object-specific offset O is defined by the component object C and object-specific offset O is defined by the
following equations: following equations:
skipping to change at page 16, line 35 skipping to change at page 16, line 34
O = (0*4096)+(9000%4096) = 808 O = (0*4096)+(9000%4096) = 808
Offset 132000: Offset 132000:
N = 132000 / 16384 = 8 N = 132000 / 16384 = 8
C = (132000 % 16384) / 4096 = 0 (D0) C = (132000 % 16384) / 4096 = 0 (D0)
O = (8*4096) + (132000%4096) = 33696 O = (8*4096) + (132000%4096) = 33696
5.3.2. Nested Striping 5.3.2. Nested Striping
The odm_group_width and odm_group_depth parameters allow a nested The odm_group_width and odm_group_depth parameters allow a nested
striping pattern. odm_group_width defines the width of a data stripe striping pattern. odm_group_width defines the width of a data stripe
and odm_group_depth defines how many stripes are written before and odm_group_depth defines how many stripes are written before
advancing to the next group of components in the list of component advancing to the next group of components in the list of component
objects for the file. The math used to map from a file offset to a objects for the file. The math used to map from a file offset to a
component object and offset within that object is shown below. The component object and offset within that object is shown below. The
computations map from the logical offset L to the component index C computations map from the logical offset L to the component index C
and offset relative O within that component object. and offset relative O within that component object.
L: logical offset into the file L: logical offset into the file
FW: total number of components FW: total number of components
skipping to change at page 19, line 42 skipping to change at page 19, line 45
implicit error caused by a client's failure to return a layout MUST implicit error caused by a client's failure to return a layout MUST
trigger recovery action by the server to prevent access to invalid trigger recovery action by the server to prevent access to invalid
data (see Section 7). It is the server's responsibility to only data (see Section 7). It is the server's responsibility to only
grant layout information to files that can be safely accessed, and to grant layout information to files that can be safely accessed, and to
deny access to files that are in an inconsistent state. deny access to files that are in an inconsistent state.
5.4.1. PNFS_OSD_RAID_0 5.4.1. PNFS_OSD_RAID_0
PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the
component objects are data bytes located by the above equations for C component objects are data bytes located by the above equations for C
and O. If a component object is marked as PNFS_OSD_MISSING, an I/O and O. If a component object is marked as PNFS_OSD_MISSING, an I/O
error MUST be returned if this component is accessed. In this case, error MUST be returned if this component is accessed. In this case,
the generic NFS client layer MAY elect to retry this operation the generic NFS client layer MAY elect to retry this operation
against the pNFS server. against the pNFS server.
5.4.2. PNFS_OSD_RAID_4 5.4.2. PNFS_OSD_RAID_4
PNFS_OSD_RAID_4 means that the last component object, or the last in PNFS_OSD_RAID_4 means that the last component object, or the last in
each group (if odm_group_width is greater than zero), contains parity each group (if odm_group_width is greater than zero), contains parity
information computed over the rest of the stripe with an XOR information computed over the rest of the stripe with an XOR
operation. If a component object is unavailable, the client can read operation. If a component object is unavailable, the client can read
skipping to change at page 21, line 27 skipping to change at page 21, line 27
Cr: The rotated device index Cr: The rotated device index
(C is as computed in the above equations for RAID-4) (C is as computed in the above equations for RAID-4)
Cr = (W + C - (R * P)) % W Cr = (W + C - (R * P)) % W
Note: W is added above to avoid negative numbers modulo math. Note: W is added above to avoid negative numbers modulo math.
5.4.4. PNFS_OSD_RAID_PQ 5.4.4. PNFS_OSD_RAID_PQ
PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon
P+Q encoding scheme [19]. In this layout, the last two component P+Q encoding scheme [16]. In this layout, the last two component
objects hold the P and Q data, respectively. P is parity computed objects hold the P and Q data, respectively. P is parity computed
with XOR. The Q computation is described in detail by Anvin [20]. with XOR. The Q computation is described in detail by Anvin [17].
The same polynomial "x^8+x^4+x^3+x^2+1" and Galois field size of 2^8 The same polynomial "x^8+x^4+x^3+x^2+1" and Galois field size of 2^8
are used here. Clients may simply choose to read data through the are used here. Clients may simply choose to read data through the
metadata server if two or more components are missing or damaged. metadata server if two or more components are missing or damaged.
The equations given above for embedded parity can be used to map a The equations given above for embedded parity can be used to map a
file offset to the correct component object by setting the number of file offset to the correct component object by setting the number of
parity components (P) to 2 instead of 1 for RAID-5 and computing the parity components (P) to 2 instead of 1 for RAID-5 and computing the
Parity Cycle length as the Lowest Common Multiple [21] of Parity Cycle length as the Lowest Common Multiple [18] of
odm_group_width and P, devided by P, as described below. Note: This odm_group_width and P, devided by P, as described below. Note: This
algorithm can be used also for RAID-5 where P=1. algorithm can be used also for RAID-5 where P=1.
P: number of parity devices P: number of parity devices
P = 2 P = 2
PC: Parity cycle: PC: Parity cycle:
PC = LCM(W, P) / P PC = LCM(W, P) / P
Q: The device index holding the Q component Q: The device index holding the Q component
skipping to change at page 22, line 18 skipping to change at page 22, line 18
serialization of updates to ensure correct operation. Otherwise, if serialization of updates to ensure correct operation. Otherwise, if
two clients simultaneously write to the same logical range of an two clients simultaneously write to the same logical range of an
object, the result could include different data in the same ranges of object, the result could include different data in the same ranges of
mirrored tuples, or corrupt parity information. It is the mirrored tuples, or corrupt parity information. It is the
responsibility of the metadata server to enforce serialization responsibility of the metadata server to enforce serialization
requirements. Serialization MUST occur at the RAID stripe boundary requirements. Serialization MUST occur at the RAID stripe boundary
for write operations to avoid corrupting parity by concurrent updates for write operations to avoid corrupting parity by concurrent updates
to the same stripe. Mirrors do not have explicit stripe boundaries, to the same stripe. Mirrors do not have explicit stripe boundaries,
so it is sufficient to serialize writes to the same byte ranges. so it is sufficient to serialize writes to the same byte ranges.
Many alternative encoding schemes exist for P>=2 [22]. These involve Many alternative encoding schemes exist for P>=2 [19]. These involve
P or Q equations different than the Reed-Solomon encoding used in P or Q equations different than the Reed-Solomon encoding used in
PNFS_OSD_RAID_PQ. Thus, if one of these schemes is to be used in the PNFS_OSD_RAID_PQ. Thus, if one of these schemes is to be used in the
future, a distinct value must be added to pnfs_osd_raid_algorithm4 future, a distinct value must be added to pnfs_osd_raid_algorithm4
for it. for it.
6. Object-Based Layout Update 6. Object-Based Layout Update
layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates
to the layout and additional information to the metadata server. It to the layout and additional information to the metadata server. It
is defined in the NFSv4.1 [6] as follows: is defined in the NFSv4.1 [4] as follows:
struct layoutupdate4 { struct layoutupdate4 {
layouttype4 lou_type; layouttype4 lou_type;
opaque lou_body<>; opaque lou_body<>;
}; };
The layoutupdate4 type is an opaque value at the generic pNFS client The layoutupdate4 type is an opaque value at the generic pNFS client
level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the
lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type. lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type.
skipping to change at page 24, line 39 skipping to change at page 24, line 40
associated error information. The second step is to request a new associated error information. The second step is to request a new
layout using LAYOUTGET and then retry the I/O operation with the new layout using LAYOUTGET and then retry the I/O operation with the new
layout. Finally, if the error persists, the client may choose to layout. Finally, if the error persists, the client may choose to
retry the I/O operation using regular NFS READ or WRITE operations retry the I/O operation using regular NFS READ or WRITE operations
via the metadata server. via the metadata server.
8. Object-Based Layout Return 8. Object-Based Layout Return
layoutreturn_file4 is used in the LAYOUTRETURN operation to convey layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
layout-type specific information to the server. It is defined in the layout-type specific information to the server. It is defined in the
NFSv4.1 [6] as follows: NFSv4.1 [4] as follows:
struct layoutreturn_file4 { struct layoutreturn_file4 {
offset4 lrf_offset; offset4 lrf_offset;
length4 lrf_length; length4 lrf_length;
stateid4 lrf_stateid; stateid4 lrf_stateid;
/* layouttype4 specific data */ /* layouttype4 specific data */
opaque lrf_body<>; opaque lrf_body<>;
}; };
union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
skipping to change at page 27, line 32 skipping to change at page 27, line 33
When OSD I/O operations failed, "olr_ioerr_report<>" is used to When OSD I/O operations failed, "olr_ioerr_report<>" is used to
report these errors to the metadata server as an array of elements of report these errors to the metadata server as an array of elements of
type pnfs_osd_ioerr4. Each element in the array represents an error type pnfs_osd_ioerr4. Each element in the array represents an error
that occurred on the object specified by oer_component. If no errors that occurred on the object specified by oer_component. If no errors
are to be reported, the size of the olr_ioerr_report<> array is set are to be reported, the size of the olr_ioerr_report<> array is set
to zero. to zero.
9. Object-Based Creation Layout Hint 9. Object-Based Creation Layout Hint
The layouthint4 type is defined in the NFSv4.1 [6] as follows: The layouthint4 type is defined in the NFSv4.1 [4] as follows:
struct layouthint4 { struct layouthint4 {
layouttype4 loh_type; layouttype4 loh_type;
opaque loh_body<>; opaque loh_body<>;
}; };
The layouthint4 structure is used by the client to pass a hint about The layouthint4 structure is used by the client to pass a hint about
the type of layout it would like created for a particular file. If the type of layout it would like created for a particular file. If
the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the loh_body the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the loh_body
opaque value is defined by the pnfs_osd_layouthint4 type. opaque value is defined by the pnfs_osd_layouthint4 type.
skipping to change at page 30, line 36 skipping to change at page 30, line 36
hold valid write layouts for the same stripes. An outstanding hold valid write layouts for the same stripes. An outstanding
READ/WRITE (RW) layout should be recalled when a conflicting READ/WRITE (RW) layout should be recalled when a conflicting
LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW
and for a byte range overlapping with the outstanding layout and for a byte range overlapping with the outstanding layout
segment. segment.
11.1. CB_RECALL_ANY 11.1. CB_RECALL_ANY
The metadata server can use the CB_RECALL_ANY callback operation to The metadata server can use the CB_RECALL_ANY callback operation to
notify the client to return some or all of its layouts. The NFSv4.1 notify the client to return some or all of its layouts. The NFSv4.1
[6] defines the following types: [4] defines the following types:
const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8;
const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9;
struct CB_RECALL_ANY4args { struct CB_RECALL_ANY4args {
uint32_t craa_objects_to_keep; uint32_t craa_objects_to_keep;
bitmap4 craa_type_mask; bitmap4 craa_type_mask;
}; };
Typically, CB_RECALL_ANY will be used to recall client state when the Typically, CB_RECALL_ANY will be used to recall client state when the
skipping to change at page 31, line 24 skipping to change at page 31, line 23
The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return
layouts of iomode LAYOUTIOMODE4_READ. Similarly, the layouts of iomode LAYOUTIOMODE4_READ. Similarly, the
PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client
is notified to return layouts of either iomode. is notified to return layouts of either iomode.
12. Client Fencing 12. Client Fencing
In cases where clients are uncommunicative and their lease has In cases where clients are uncommunicative and their lease has
expired or when clients fail to return recalled layouts within a expired or when clients fail to return recalled layouts within a
lease period at the least (see "Recalling a Layout"[6]), the server lease period at the least (see "Recalling a Layout"[4]), the server
MAY revoke client layouts and/or device address mappings and reassign MAY revoke client layouts and/or device address mappings and reassign
these resources to other clients. To avoid data corruption, the these resources to other clients. To avoid data corruption, the
metadata server MUST fence off the revoked clients from the metadata server MUST fence off the revoked clients from the
respective objects as described in Section 13.4. respective objects as described in Section 13.4.
13. Security Considerations 13. Security Considerations
The pNFS extension partitions the NFSv4 file system protocol into two The pNFS extension partitions the NFSv4 file system protocol into two
parts, the control path and the data path (storage protocol). The parts, the control path and the data path (storage protocol). The
control path contains all the new operations described by this control path contains all the new operations described by this
skipping to change at page 33, line 52 skipping to change at page 33, line 49
LAYOUTGET returns a CapKey and a Cap, which, together with the OSD LAYOUTGET returns a CapKey and a Cap, which, together with the OSD
SystemID, are also called a credential. It is a capability and a SystemID, are also called a credential. It is a capability and a
signature over that capability and the SystemID. The OSD Standard signature over that capability and the SystemID. The OSD Standard
refers to the CapKey as the "Credential integrity check value" and to refers to the CapKey as the "Credential integrity check value" and to
the ReqMAC as the "Request integrity check value". the ReqMAC as the "Request integrity check value".
CapKey = MAC<SecretKey>(Cap, SystemID) CapKey = MAC<SecretKey>(Cap, SystemID)
Credential = {Cap, SystemID, CapKey} Credential = {Cap, SystemID, CapKey}
The client uses CapKey to sign all the requests it issues for that The client uses CapKey to sign all the requests it issues for that
object using the respective Cap. In other words, the Cap appears in object using the respective Cap. In other words, the Cap appears in
the request to the storage device, and that request is signed with the request to the storage device, and that request is signed with
the CapKey as follows: the CapKey as follows:
ReqMAC = MAC<CapKey>(Req, ReqNonce) ReqMAC = MAC<CapKey>(Req, ReqNonce)
Request = {Cap, Req, ReqNonce, ReqMAC} Request = {Cap, Req, ReqNonce, ReqMAC}
The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}. The The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}. The
OSD uses the SecretKey it shares with the metadata server to compare OSD uses the SecretKey it shares with the metadata server to compare
the ReqMAC the client sent with a locally computed value: the ReqMAC the client sent with a locally computed value:
skipping to change at page 35, line 30 skipping to change at page 35, line 28
the pNFS client will obtain a separate layout for each user accessing the pNFS client will obtain a separate layout for each user accessing
a shared object. The client SHOULD use OPEN and ACCESS calls to a shared object. The client SHOULD use OPEN and ACCESS calls to
check user permissions when performing I/O so that the server's check user permissions when performing I/O so that the server's
access control policies are correctly enforced. The result of the access control policies are correctly enforced. The result of the
ACCESS operation may be cached while the client holds a valid layout ACCESS operation may be cached while the client holds a valid layout
as the server is expected to recall layouts when the file's access as the server is expected to recall layouts when the file's access
permissions or ACL change. permissions or ACL change.
14. IANA Considerations 14. IANA Considerations
As described in NFSv4.1 [6], new layout type numbers have been As described in NFSv4.1 [4], new layout type numbers have been
assigned by IANA. This document defines the protocol associated with assigned by IANA. This document defines the protocol associated with
the existing layout type number, LAYOUT4_OSD2_OBJECTS, and it the existing layout type number, LAYOUT4_OSD2_OBJECTS, and it
requires no further actions for IANA. requires no further actions for IANA.
15. References 15. References
15.1. Normative References 15.1. Normative References
[1] Weber, R., "Information Technology - SCSI Object-Based Storage [1] Weber, R., "Information Technology - SCSI Object-Based
Device Commands (OSD)", ANSI INCITS 400-2004, December 2004. Storage Device Commands (OSD)", ANSI INCITS 400-2004,
December 2004.
[2] Bradner, S., "Key words for use in RFCs to Indicate Requirement [2] Bradner, S., "Key words for use in RFCs to Indicate
Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[3] Eisler, M., "XDR: External Data Representation Standard", [3] IETF Trust, "Legal Provisions Relating to IETF Documents",
STD 67, RFC 4506, May 2006. November 2008, <http://trustee.ietf.org/docs/
IETF-Trust-License-Policy.pdf>.
[4] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network [4] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
File System (NFS) Version 4 Minor Version 1 External Data "Network File System (NFS) Version 4 Minor Version 1
Representation Standard (XDR) Description", RFC 5662, Protocol", RFC 5661, January 2010.
January 2010.
[5] IETF Trust, "Legal Provisions Relating to IETF Documents", [5] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
November 2008, "Network File System (NFS) Version 4 Minor Version 1
<http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf>. External Data Representation Standard (XDR) Description",
RFC 5662, January 2010.
[6] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network [6] Eisler, M., "XDR: External Data Representation Standard",
File System (NFS) Version 4 Minor Version 1 Protocol", STD 67, RFC 4506, May 2006.
RFC 5661, January 2010.
[7] Linn, J., "Generic Security Service Application Program [7] Linn, J., "Generic Security Service Application Program
Interface Version 2, Update 1", RFC 2743, January 2000. Interface Version 2, Update 1", RFC 2743, January 2000.
[8] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. [8] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M.,
Zeidner, "Internet Small Computer Systems Interface (iSCSI)", and E. Zeidner, "Internet Small Computer Systems Interface
RFC 3720, April 2004. (iSCSI)", RFC 3720, April 2004.
[9] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI [9] Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network
INCITS 408-2005, October 2005. Address Authority (NAA) Naming Format for iSCSI Node
Names", RFC 3980, February 2005.
[10] Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network [10] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI
Address Authority (NAA) Naming Format for iSCSI Node Names", INCITS 408-2005, October 2005.
RFC 3980, February 2005.
[11] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64) [11] Weber, R., "SCSI Architecture Model - 3 (SAM-3)", ANSI
Registration Authority", INCITS 402-2005, February 2005.
<http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>.
[12] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J. [12] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and
Souza, "Internet Storage Name Service (iSNS)", RFC 4171, J. Souza, "Internet Storage Name Service (iSNS)", RFC
September 2005. 4171, September 2005.
[13] Weber, R., "SCSI Architecture Model - 3 (SAM-3)", ANSI [13] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64)
INCITS 402-2005, February 2005. Registration Authority", <http://standards.ieee.org/
regauth/oui/tutorials/EUI64.html>.
15.2. Informative References 15.2. Informative References
[14] Weber, R., "SCSI Object-Based Storage Device Commands -2 [14] Weber, R., "SCSI Object-Based Storage Device Commands -2
(OSD-2)", January 2009, (OSD-2)", January 2009,
<http://www.t10.org/cgi-bin/ac.pl?t=f&f=osd2r05a.pdf>. <http://www.t10.org/cgi-bin/ac.pl?t=f&f=osd2r05a.pdf>.
[15] Kent, S. and K. Seo, "Security Architecture for the Internet [15] Kent, S. and K. Seo, "Security Architecture for the
Protocol", RFC 4301, December 2005. Internet Protocol", RFC 4301, December 2005.
[16] T10 1415-D, "SCSI RDMA Protocol (SRP)", ANSI INCITS 365-2002, [16] MacWilliams, F. and N. Sloane, "The Theory of Error-
December 2002. Correcting Codes, Part I", 1977.
[17] T11 1619-D, "Fibre Channel Framing and Signaling - 2 [17] Anvin, H., "The Mathematics of RAID-6", May 2009,
(FC-FS-2)", ANSI INCITS 424-2007, February 2007. <http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>.
[18] T10 1601-D, "Serial Attached SCSI - 1.1 (SAS-1.1)", ANSI [18] The free encyclopedia, Wikipedia., "Least common
INCITS 417-2006, June 2006. multiple", April 2011,
<http://en.wikipedia.org/wiki/Least_common_multiple>.
[19] MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting [19] Plank, James S., and Luo, Jianqiang and Schuman, Catherine
Codes, Part I", 1977. D. and Xu, Lihao and Wilcox-O'Hearn, Zooko, , "A
Performance Evaluation and Examination of Open-source
Erasure Coding Libraries for Storage", 2007.
[20] Anvin, H., "The Mathematics of RAID-6", May 2009, [20] T10 1415-D, "SCSI RDMA Protocol (SRP)", ANSI INCITS
<http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>. 365-2002, December 2002.
[21] The free encyclopedia, Wikipedia., "Least common multiple", [21] T11 1619-D, "Fibre Channel Framing and Signaling - 2 (FC-
April 2011, FS-2)", ANSI INCITS 424-2007, February 2007.
<http://en.wikipedia.org/wiki/Least_common_multiple>.
[22] Plank, James S., and Luo, Jianqiang and Schuman, Catherine D. [22] T10 1601-D, "Serial Attached SCSI - 1.1 (SAS-1.1)", ANSI
and Xu, Lihao and Wilcox-O'Hearn, Zooko, "A Performance INCITS 417-2006, June 2006.
Evaluation and Examination of Open-source Erasure Coding
Libraries for Storage", 2007.
Appendix A. Acknowledgments Appendix A. Acknowledgments
Todd Pisek was a co-editor of the initial versions of this document. Todd Pisek was a co-editor of the initial versions of this document.
Daniel E. Messinger, Pete Wyckoff, Mike Eisler, Sean P. Turner, Brian Daniel E. Messinger, Pete Wyckoff, Mike Eisler, Sean P. Turner, Brian
E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and
commented on this document. commented on this document.
Authors' Addresses Authors' Addresses
skipping to change at page 38, line 4 skipping to change at page 37, line 37
E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and
commented on this document. commented on this document.
Authors' Addresses Authors' Addresses
Benny Halevy Benny Halevy
Primary Data Primary Data
Email: bhalevy@primarydata.com Email: bhalevy@primarydata.com
URI: http://www.primarydata.com/ URI: http://www.primarydata.com/
Boaz Harrosh Boaz Harrosh
Panasas, Inc. Panasas, Inc.
1501 Reedsdale St. Suite 400 1501 Reedsdale St. Suite 400
Pittsburgh, PA 15233 Pittsburgh, PA 15233
USA USA
Phone: +1-412-323-3500 Phone: +1-412-323-3500
Email: bharrosh@panasas.com Email: bharrosh@panasas.com
URI: http://www.panasas.com/ URI: http://www.panasas.com/
Brent Welch Brent Welch
Panasas, Inc. Panasas, Inc.
969 W. Maude Ave 969 W. Maude Ave
Sunnyvale, CA 94095 Sunnyvale, CA 94095
USA USA
Phone: +1-408-215-6715 Phone: +1-408-215-6715
Email: welch@acm.org Email: welch@acm.org
URI: http://www.panasas.com/ URI: http://www.panasas.com/
Brian Mueller
Panasas, Inc.
Email: bmueller@panasas.com
URI: http://www.panasas.com/
 End of changes. 64 change blocks. 
154 lines changed or deleted 157 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/