[Docs] [txt|pdf] [Tracker] [Email] [Diff1] [Diff2] [Nits]

Versions: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14

Network Working Group                                           A. Boyko
Internet-Draft                                       Library of Congress
Expires: December 1, 2008                                       J. Kunze
                                              California Digital Library
                                                               L. Madden
                                                              J. Littman
                                                     Library of Congress
                                                            May 30, 2008

                 The BagIt File Package Format (V0.94)

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at

   The list of Internet-Draft Shadow Directories can be accessed at

   This Internet-Draft will expire on December 1, 2008.

Copyright Notice

   Copyright (C) The IETF Trust (2008).

Boyko, et al.           Expires December 1, 2008                [Page 1]

Internet-Draft                    BagIt                         May 2008


   This document specifies BagIt, a hierarchical file package format for
   the exchange of generalized digital content.  A "bag" has just enough
   structure to safely enclose a brief "tag" and a payload but does not
   require any knowledge of the payload's internal semantics.  This
   BagIt format should be suitable for disk-based or network-based file
   package transfer.  One important use case is the possibility of
   eventual safe return of a received bag.  Tag information consists of
   a small number of top-level reserved file names, checksums for
   transfer validation, and optional small metadata blocks.

Boyko, et al.           Expires December 1, 2008                [Page 2]

Internet-Draft                    BagIt                         May 2008

1.  Introduction

   BagIt is a hierarchical file package format for the exchange of
   generalized digital content.  A "bag" has just enough structure to
   safely enclose a brief "tag" and a payload but does not require any
   knowledge of the payload's internal semantics.  This BagIt format
   should be suitable for disk-based or network-based file package
   transfer.  Use cases include long-term storage and the possibility of
   eventual safe return of a received bag.  Tag information consists of
   a small number of top-level reserved file names, checksums for
   transfer validation, and optional small metadata blocks.  The name
   BagIt is inspired by the "enclose and deposit" method [ENCDEP],
   sometimes referred to as "bag it and tag it".

   In this document the word "directory" is used interchangeably with
   the word "folder" and all examples conform to Unix-based filesystem
   conventions which should translate easily to Windows conventions
   after substituting the path separator ('\' instead of '/').  The
   BagIt format itself places no limitations on file and path lengths,
   so implementors thinking about maximal interoperation may wish to
   consider the issues listed in the Interoperability section of this

Boyko, et al.           Expires December 1, 2008                [Page 3]

Internet-Draft                    BagIt                         May 2008

2.  BagIt package layout

   A "bag" consists of a base directory containing a set of top-level
   files comprising the "tag" and a sub-directory named "data/" that
   holds the payload.  The base directory may have any name and the
   "data/" directory may contain an arbitrary file hierarchy.

           |    manifest-<algorithm>.txt
           |    bagit.txt
           |    [optional additional tag files]
           \--- data/
                 |    [optional file hierarchy]

   The "tag" consists of one or more files named "manifest-
   _algorithm_.txt", a file named "bagit.txt", and zero or more
   additional files.  In top-level text files with ".txt" extension,
   each line should be terminated by a newline (LF) or carriage return
   plus newline (CRLF); in practice cautious programmers will also
   accept a carriage return by itself (CR) as a line terminator.  In all
   such tag files, text is assumed to be Unicode encoded as UTF-8

2.1.  Declaration that this is a bag:  bagit.txt

   The "bagit.txt" file should consist of exactly two lines,

   BagIt-Version: M.N
   Tag-File-Character-Encoding: UTF-8

   where M.N identifies the BagIt major (M) and minor (N) version
   numbers, and UTF-8 identifies the character set encoding of tag

2.2.  File manifest: manifest-<algorithm>.txt

   One or more manifest files must be present.  A manifest is a top-
   level file with a name of the form manifest-_algorithm_.txt, where
   _algorithm_ is a string specifying a cryptographic checksum
   algorithm, such as


   Implementors of tools that create and validate bags are strongly
   encouraged to support at least two widely implemented checksum
   algorithms: "md5" [RFC1321] and "sha1" [RFC3174].  When using other
   algorithms, the name of the algorithm should be normalized for use in

Boyko, et al.           Expires December 1, 2008                [Page 4]

Internet-Draft                    BagIt                         May 2008

   the manifest's filename, by lowercasing the common name of the
   algorithm, and removing all non-alphanumeric characters.

   A manifest contains a complete list of payload files that must be
   present in a fully constituted bag.  Each line of a file manifest-
   _algorithm_.txt has the form


   where FILENAME is the pathname of a payload file relative to the base
   directory and CHECKSUM is a hex-encoded checksum calculated according
   to _algorithm_ over the file's contents.  As a result, every payload
   FILENAME listed begins "data/...".  Any tag (top-level) FILENAME may
   optionally appear in a manifest.  One or more linear whitespace
   characters (spaces or tabs) separate the CHECKSUM and FILENAME.

Boyko, et al.           Expires December 1, 2008                [Page 5]

Internet-Draft                    BagIt                         May 2008

3.  Example of a basic bag

   Here's a very simple bag containing an image and a companion OCR
   file.  Lines of file content are shown in parentheses beneath the
   file name.

   |   manifest-md5.txt
   |    (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png)
   |    (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt)
   |   bagit.txt
   |    (BagIt-version: 0.9                                           )
   |    (Tag-File-Character-Encoding: UTF-8                           )
   \--- data/
        |   27613-h/images/q172.png
        |    (... image bytes ...                                     )
        |   27613-h/images/q172.txt
        |    (... OCR text ...                                        )

Boyko, et al.           Expires December 1, 2008                [Page 6]

Internet-Draft                    BagIt                         May 2008

4.  Completing a bag: fetch.txt

   For reasons of efficiency, a bag may be sent with a list of files to
   be fetched and added to the payload before it can meaningfully be
   checked for completeness.  An optional top-level file named
   "fetch.txt", if present, contains such a list.  Each line of
   "fetch.txt" has the form


   where URL identifies the file to be fetched, LENGTH is the number of
   octets in the file (or "-", to leave it unspecified), and FILENAME
   identifies the corresponding payload file.  One or more linear
   whitespace characters (spaces or tabs) separate these three values,
   and any such characters in the URL must be percent-encoded [RFC3986].

   Because "fetch.txt" lists files that are absent from a sent bag,
   receivers that are storing completed bags will want some way to
   record that the bag no longer needs completing, such as renaming this
   file (e.g., to "fetch-orig.txt") or changing a database flag; if bag
   return is supported, the "fetch.txt" file will need to be modified as
   appropriate to support the return transmission method.  Receipt of a
   bag is not final until all absent files are fetched.  The receiver of
   a bag with a "fetch.txt" tag file is expected promptly to complete
   the bag by fetching all URL-identified components as the sender is
   not bound to make the absent components available indefinitely.

   The "fetch.txt" file essentially allows a bag to be transmitted with
   "holes" in it, which can be practical for several reasons.  For
   example, it obviates the need for the sender to stage a large
   serialized copy of the content until the bag is transferred to the
   receiver.  Also, this method allows a sender to construct a bag from
   components that are either a subset of logically related components
   (e.g., the localized logical object could be much larger than what is
   intended for export) or assembled from logically distributed sources
   (e.g., the object components for export are not stored locally under
   one filesystem tree).

Boyko, et al.           Expires December 1, 2008                [Page 7]

Internet-Draft                    BagIt                         May 2008

5.  Checksum of a tag file: *.<algorithm>

   Sometimes it is desirable to record a checksum for a tag file without
   listing it in a manifest, but using instead an accompanying _tag
   checksum file_.  Applications include recording a checksum for the
   file manifest itself, or for a tag file added after the manifest was
   received.  The name of a tag checksum file for tag file _tfname_ has
   the form _tfname_._algorithm_.  For example, a tag checksum file
   using MD5 over "manifest-sha1.txt" would have the name


   A tag checksum file contains a single line having the same form
   (CHECKSUM FILENAME) and semantics as a file manifest.  In essence, it
   is a one-line manifest listing the hex-encoded checksum calculated
   according to _algorithm_ over the contents of _tfname_.

Boyko, et al.           Expires December 1, 2008                [Page 8]

Internet-Draft                    BagIt                         May 2008

6.  Valid bags and complete bags

   A bag is considered _valid_ if it is _complete_ and if each CHECKSUM
   in every manifest can be verified against the contents of its
   corresponding FILENAME.

   A bag is considered _complete_ if every manifest covers the same set
   of files and every file in the payload is listed in every manifest.
   This means that a bag is complete (a) if the set of files listed in
   any one manifest is identical to the set of files listed in every
   manifest, (b) if every payload and tag file listed in every manifest
   is present, and (c) if every file present in the payload is listed in
   every manifest.  Hence tag files do not need to be listed in the
   manifest(s), but in a complete bag any tag files appearing in one
   manifest must appear in all manifests.

Boyko, et al.           Expires December 1, 2008                [Page 9]

Internet-Draft                    BagIt                         May 2008

7.  Other BagIt metadata: package-info.txt

   Any other tag files are considered to be package information separate
   from the payload content.  The "data/" directory is the custodial
   focus of a bag, and the top-level files comprising the tag are
   intended to facilitate and document the transfer.  The tag could also
   be used to help in returning the bag to its sender at some point in
   the future.

   Tag information is optional.  If present, tag information at a
   minimum consists of a package-info.txt file.  This is a text file
   intended primarily for human readability using email-style headers
   [RFC2822].  It is recommended that lines not exceed 79 characters in
   length.  As mentioned earlier, text is assumed to be Unicode encoded
   as UTF-8.

   The package-info.txt file contains metadata elements describing the
   overall package.  It looks like this.

    Source-Organization: Spengler University
    Organization-Address: 1400 Elm St., Cupertino, California, 95014
    Contact-Name: Edna Janssen
    Contact-Phone: +1 408-555-1212
    Contact-Email: ej@spengler.edu
    External-Description: Uncompressed greyscale TIFF images from the
         Yoshimuri papers colle...
    Delivery-Date: 2008-01-15
    External-Identifier: spengler_yoshimuri_001
    Package-Size: 260 GB
    Bag-Group-Identifier: spengler_yoshimuri
    Bag-Count: 1 of 15
    Internal-Sender-Identifier: /storage/images/yoshimuri
    Internal-Sender-Description: Uncompressed greyscale TIFFs created from
         microfilm and are...

   All elements are provided as clues to ease handling on the sender and
   receiver ends.  No particular relationship between the sender
   organization and the payload content is assumed; for example, the
   sender may be a content aggregator, redistributor, collector,
   curator, or producer.

   Reserved element names are case-insensitive and defined as follows.

   Source-Organization  Organization transferring the content.

Boyko, et al.           Expires December 1, 2008               [Page 10]

Internet-Draft                    BagIt                         May 2008

   Organization-Address  Mailing address of the organization.

   Contact-Name  Person at the source organization who is responsible
      for the content transfer.

   Contact-Phone  International format telephone number of person or
      position responsible.

   Contact-Email  Fully qualified email address of person or position

   External-Description  A brief explanation of the contents and

   Delivery-Date  Date (YYYY-MM-DD) that the content is being

   External-Identifier  A sender-supplied identifier for the package.
      This identifier must be unique across the sender's content, and if
      recognizable as belonging to a globally unique scheme, the
      receiver should make an effort to honor reference to it.

   Package-Size  Size or approximate size of the package being
      transferred, followed by an abbreviation such as MB (megabytes),
      GB, or TB; for example, 42600 MB, 42.6 GB, or .043 TB.

   Bag-Group-Identifier  (optional) A sender-supplied identifier for the
      set, if any, of bags to which it logically belongs.  This
      identifier must be unique across the sender's content, and if
      recognizable as belonging to a globally unique scheme, the
      receiver should make an effort to honor reference to it.

   Bag-Count  (optional) Two numbers separated by "of", in particular,
      "N of T", where T is the total number of bags in a group of bags
      and N is the ordinal number within the group; if T is not known,
      specify it as "?" (question mark).  Examples: 1 of 2, 4 of 4, 3 of
      ?, 89 of 145.

   Internal-Sender-Identifier  (optional) An alternate sender-specific
      identifier for the content and/or package.  This value may be
      useful to senders who may retrieve the content in the future.  For
      instance, it might contain values that are relevant to the re-use
      of the content at the sender's organization.

   Internal-Sender-Description  (optional) A sender-local prose
      description of the contents of the package, to assist in later use
      if returned to the sender.

Boyko, et al.           Expires December 1, 2008               [Page 11]

Internet-Draft                    BagIt                         May 2008

   Arbitrary other package metadata elements may follow these elements.
   Such elements could be used to describe the payload in ways intended
   for the sender in case of bag return.

Boyko, et al.           Expires December 1, 2008               [Page 12]

Internet-Draft                    BagIt                         May 2008

8.  Bag serialization

   In some scenarios, such as network transfer, it may be convenient for
   the sender first to serialize the filesystem hierarchy representing
   the bag (the outermost base directory) into a single-file archive
   format such as TAR or ZIP.  After receiving the resulting aggregate
   file, which we will call a _serialization_, the receiver deserializes
   it to recreate the filesystem hierarchy.  Several rules govern the
   serialization of a BagIt bag and apply equally to TAR or ZIP archive

   1.  One and only one bag is contained in one serialization.

   2.  The serialization has the same name as the bag's base directory,
       but with an extension added to identify the format; for example,
       the receiver of "mybag.tar.gz" expects the corresponding base
       directory to be created as "mybag".

   3.  A bag is never serialized from within its base directory, but
       from the parent of the base directory (where the base directory
       appears as an entry).  Thus, after a bag is deserialized in an
       empty directory, a listing of that directory shows exactly one
       entry.  For example, deserializing "mybag.zip" in an empty
       directory causes the creation of the base directory "mybag" with
       the payload and all the tag files deserialized beneath it.

   4.  One un-archiving (deserialization) step produces a single base
       directory bag with the top-level structure as described in this
       document without requiring an additional un-archiving step.  For
       example, after one un-archiving step it would be an error for the
       "data/" directory to appear as "data.tar.gz".  TAR and ZIP files
       may appear inside the payload beneath the "data/" directory,
       where they would be treated as opaquely as any other payload file
       or directory.

   When packaging a bag in an archive file format, care must be taken to
   ensure that the format's restrictions on file naming, such as
   allowable characters, length, or character encoding, will support the
   receiver requirements of the bag being packaged.

Boyko, et al.           Expires December 1, 2008               [Page 13]

Internet-Draft                    BagIt                         May 2008

9.  Disk and network transfer

   When recording a bag on physical media (such as hard disk, CD-ROM, or
   DVD), the sender will need to select and format the media in a manner
   compatible with the bag's content requirements for such things as
   file names and sizes, and with the receiver's technical
   infrastructure.  If the receiver's infrastructure is not known or the
   media needs to be compatible with a range of potential receivers,
   consideration should be given to portability and common usage.  For
   example, a "lowest common denominator" for some content and potential
   receivers could be USB disk drives formatted with the FAT32

   Transmitting a whole bag in serialized form as a single file will
   tend to be the most straightforward mode of transfer.  When
   throughput is a priority, use of "fetch.txt" lends itself to an easy,
   application-level parallelism in which the list of URL-addressed
   items to fetch is divided among multiple processes.  The mechanics of
   sending and receiving bags over networks is otherwise out of scope of
   the present document and may be facilitated by protocols such as

Boyko, et al.           Expires December 1, 2008               [Page 14]

Internet-Draft                    BagIt                         May 2008

10.  Another example bag

   Here's a bag of material resulting from a hypothetical web harvest.
   As before, lines of file content are shown in parentheses beneath the
   file name, with long lines continued indented on subsequent lines.
   This bag is not completely retrieved, of course, until every
   component listed in the "fetch.txt" file is retrieved.

Boyko, et al.           Expires December 1, 2008               [Page 15]

Internet-Draft                    BagIt                         May 2008

|   manifest-md5.txt
|    (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt        )
|    (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt                  )
|    (61c96810788283dc7be157b340e4eff4 data/gov-20060601-oth-050019.arc.gz )
|    (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-img-100002.arc.gz )
|   fetch.txt
|    (http://WB20.Stanford.Edu/gov-06-2006-ARC/gov-20060601-oth-050019.arc.gz
|        26583985 data/gov-20060601-oth-050019.arc.gz                      )
|    (http://WB20.Stanford.Edu/gov-06-2006-ARC/gov-20060601-img-100002.arc.gz
|        99509720 data/gov-20060601-img-100002.arc.gz                      )
|    ( ..................................................................... )
|   package-info.txt
|    (Source-organization: California Digital Library                      )
|    (Organization-address: 415 20th Street, 4th Floor, Oakland, CA. 94612 )
|    (Contact-name: A. E. Newman                                           )
|    (Contact-phone: +1 510-555-1234                                       )
|    (Contact-email: alfred@ucop.edu                                       )
|    (External-Description: The collection "Local Davis Flood Control      )
|      Collection" includes captured California State and local websites   )
|      containing information on flood control resources for the Davis and )
|      Sacramento area.  Sites were captured by UC Davis curator Wrigley   )
|      Spyder using the Web Archiving Service in February 2007 and         )
|      October 2007.                                                       )
|    (Delivery-date: 2008.04.15                                            )
|    (External-identifier: ark:/13030/fk4jm2bcp                            )
|    (Package-size: about 22Gb                                             )
|    (Internal-sender-identifier: UCDL                                     )
|    (Internal-sender-description: University of California Davis Libraries)
|   bagit.txt
|    (BagIt-version: 0.9                                                   )
|    (Tag-File-Character-Encoding: UTF-8                                   )
\--- data/
     |   Collection Overview.txt
     |    (... narrative description ...                                   )
     |   Seed List.txt
     |    (... list of crawler starting point URLs ...                     )

Boyko, et al.           Expires December 1, 2008               [Page 16]

Internet-Draft                    BagIt                         May 2008

11.  Interoperability: Windows and Unix file naming

   Whether transmitted over a network or on physical media, the naming
   of the files in the bag may be affected by differences in platforms
   between the sending and receiving sides of the transfer.  Besides the
   fundamental difference between path separators ('\' and '/'),
   generally, Windows filesystems have more limitations than Unix
   filesystems.  Windows path names have a maximum of 255 characters,
   and none of these characters may be used in a path component:

       < > : " / | ? *

   Windows also reserves the following names: CON, PRN, AUX, NUL, COM1,
   COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3,
   LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9.  See [MSFNAM] for more

Boyko, et al.           Expires December 1, 2008               [Page 17]

Internet-Draft                    BagIt                         May 2008

12.  Security considerations

   The BagIt package format poses no direct risk to computers and
   networks.  Implementors of tools that complete bags by retrieving
   URLs listed in a "fetch.txt" file need to be aware that some of those
   URLs may point to hosts, intentionally or unintentionally, that are
   not under control of the bag's sender.  Checksum algorithms are
   designed to protect against corruption and spoofing in bag transfer,
   but they are not a guarantee.

Boyko, et al.           Expires December 1, 2008               [Page 18]

Internet-Draft                    BagIt                         May 2008

13.  References

   [ENCDEP]   Tabata, K., "A Collaboration Model between Archival
              Systems to Enhance the Reliability of Preservation by an
              Enclose-and-Deposit Method", 2005,

   [GRABIT]   NDIIPP/CDL, "The GrabIt Package Exchange Protocol", 2008,

   [MSFNAM]   Microsoft, "Naming a File", 2008,

   [RFC1321]  Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321,
              April 1992.

   [RFC2822]  Resnick, P., "Internet Message Format", RFC 2822,
              April 2001.

   [RFC3174]  Eastlake, D. and P. Jones, "US Secure Hash Algorithm 1
              (SHA1)", RFC 3174, September 2001.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, November 2003.

   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifier (URI): Generic Syntax", STD 66,
              RFC 3986, January 2005.

Boyko, et al.           Expires December 1, 2008               [Page 19]

Internet-Draft                    BagIt                         May 2008

Appendix A.  Change history

   (This appendix to be removed in the final draft.)

A.1.  Changes from draft-00, 2008.03.24

   Author address corrections and miscellaneous stylistic edits.

   Added some mention of physical media-based transfers, preferred
   characteristics of transfer filesystems, and network transfer issues.

   Added basic bag example early and changed the narrative to more
   clearly delineate component files.

   Wording changes under fetch.txt, and note that fetch.txt will need to
   be modified before bag return.

   Fixed checksum encoding reference to base64 rather than hex.  (B.

   Described simple normalization approach for checksum algorithm names.
   (B. Vargas)

   In the example bag, add the ARC files found in the fetch.txt to the
   manifest as well (A. Turoff)

Boyko, et al.           Expires December 1, 2008               [Page 20]

Internet-Draft                    BagIt                         May 2008

Authors' Addresses

   Andy Boyko
   Library of Congress
   101 Independence Avenue SE
   Washington, DC  20540

   Email: andy@boyko.net

   John A. Kunze
   California Digital Library
   415 20th St, 4th Floor
   Oakland, CA  94612

   Fax:   +1 510-893-5212
   Email: jak@ucop.edu

   Liz Madden
   Library of Congress
   101 Independence Avenue SE
   Washington, DC  20540

   Fax:   +1 202-707-1957
   Email: emad@loc.gov

   Justin Littman
   Library of Congress
   101 Independence Avenue SE
   Washington, DC  20540

   Fax:   +1 202-707-1957
   Email: jlit@loc.gov

Boyko, et al.           Expires December 1, 2008               [Page 21]

Internet-Draft                    BagIt                         May 2008

Full Copyright Statement

   Copyright (C) The IETF Trust (2008).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at


   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).

Boyko, et al.           Expires December 1, 2008               [Page 22]

Html markup produced by rfcmarkup 1.122, available from https://tools.ietf.org/tools/rfcmarkup/