[Docs] [txt|pdf] [Tracker] [Email] [Diff1] [Diff2] [Nits]
Versions: 00 01 02 03 04 05 06 07 08 09 10 11
12 13 14 15 16 17 RFC 8493
Network Working Group A. Boyko
Internet-Draft Library of Congress
Expires: January 12, 2009 J. Kunze
California Digital Library
J. Littman
L. Madden
Library of Congress
July 11, 2008
The BagIt File Package Format (V0.95)
http://www.ietf.org/internet-drafts/draft-kunze-bagit-02.txt
Status of this Memo
By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on January 12, 2009.
Copyright Notice
Copyright (C) The IETF Trust (2008).
Boyko, et al. Expires January 12, 2009 [Page 1]
Internet-Draft BagIt July 2008
Abstract
This document specifies BagIt, a hierarchical file package format for
the exchange of generalized digital content. A "bag" has just enough
structure to safely enclose a brief "tag" and a payload but does not
require any knowledge of the payload's internal semantics. This
BagIt format should be suitable for disk-based or network-based file
package transfer. One important use case is the possibility of
eventual safe return of a received bag. Tag information consists of
a small number of top-level reserved file names, checksums for
transfer validation, and optional small metadata blocks.
Boyko, et al. Expires January 12, 2009 [Page 2]
Internet-Draft BagIt July 2008
1. Introduction
BagIt is a hierarchical file package format designed to support disk-
based or network-based transfer of generalized digital content. A
"bag" holds a brief "tag" and an otherwise semantically opaque
payload. The name, BagIt, is inspired by the "enclose and deposit"
method [ENCDEP], sometimes referred to as "bag it and tag it".
In this document the word "directory" is used interchangeably with
the word "folder" and all examples conform to Unix-based filesystem
conventions which should translate easily to Windows conventions
after substituting the path separator ('\' instead of '/'). The
BagIt format itself places no limitations on file and path lengths,
so implementors thinking about maximal interoperation may wish to
consider the issues listed in the Interoperability section of this
document.
Boyko, et al. Expires January 12, 2009 [Page 3]
Internet-Draft BagIt July 2008
2. BagIt package layout
A "bag" consists of a base directory containing a set of top-level
files comprising the "tag" and a sub-directory named "data/" that
holds the payload. The base directory may have any name and the
"data/" directory may contain an arbitrary file hierarchy.
<bag_dir>/
| manifest-<algorithm>.txt
| bagit.txt
| [optional additional tag files]
\--- data/
| [optional file hierarchy]
The "tag" consists of one or more files named "manifest-
_algorithm_.txt", a file named "bagit.txt", and zero or more
additional files. In top-level text files with ".txt" extension,
each line should be terminated by a newline (LF) or carriage return
plus newline (CRLF); in practice cautious programmers will also
accept a carriage return by itself (CR) as a line terminator. In all
such tag files, text is assumed to be Unicode encoded as UTF-8
[RFC3629].
2.1. Declaration that this is a bag: bagit.txt
The "bagit.txt" file should consist of exactly two lines,
BagIt-Version: M.N
Tag-File-Character-Encoding: UTF-8
where M.N identifies the BagIt major (M) and minor (N) version
numbers, and UTF-8 identifies the character set encoding of tag
files.
2.2. File manifest: manifest-<algorithm>.txt
A manifest is a top-level file listing payload files that must be
present in a complete bag. Every bag must contain one or more
manifest files. A manifest file has a name of the form manifest-
_algorithm_.txt, where _algorithm_ is a string specifying a
cryptographic checksum algorithm, such as
manifest-md5.txt
manifest-sha1.txt
Implementors of tools that create and validate bags are strongly
encouraged to support at least two widely implemented checksum
algorithms: "md5" [RFC1321] and "sha1" [RFC3174]. When using other
Boyko, et al. Expires January 12, 2009 [Page 4]
Internet-Draft BagIt July 2008
algorithms, the name of the algorithm should be normalized for use in
the manifest's filename, by lowercasing the common name of the
algorithm, and removing all non-alphanumeric characters.
A manifest contains a complete list of files that must be present in
a fully constituted bag. Each line of a file manifest-
_algorithm_.txt has the form
CHECKSUM FILENAME
where FILENAME is the pathname of a file relative to the base
directory and CHECKSUM is a hex-encoded checksum calculated according
to _algorithm_ over the file's contents. As described below, tag
(top-level) files should be listed, if listed at all, in a separate
tag manifest file. One or more linear whitespace characters (spaces
or tabs) separate the CHECKSUM and FILENAME.
Boyko, et al. Expires January 12, 2009 [Page 5]
Internet-Draft BagIt July 2008
3. Example of a basic bag
Here's a very simple bag containing an image and a companion OCR
file. Lines of file content are shown in parentheses beneath the
file name.
myfirstbag/
|
| manifest-md5.txt
| (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png)
| (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt)
|
| bagit.txt
| (BagIt-version: 0.9 )
| (Tag-File-Character-Encoding: UTF-8 )
|
\--- data/
|
| 27613-h/images/q172.png
| (... image bytes ... )
|
| 27613-h/images/q172.txt
| (... OCR text ... )
....
Boyko, et al. Expires January 12, 2009 [Page 6]
Internet-Draft BagIt July 2008
4. Completing a bag: fetch.txt
For reasons of efficiency, a bag may be sent with a list of files to
be fetched and added to the payload before it can meaningfully be
checked for completeness. An optional top-level file named
"fetch.txt", if present, contains such a list. Each line of
"fetch.txt" has the form
URL LENGTH FILENAME
where URL identifies the file to be fetched, LENGTH is the number of
octets in the file (or "-", to leave it unspecified), and FILENAME
identifies the corresponding payload file. One or more linear
whitespace characters (spaces or tabs) separate these three values,
and any such characters in the URL must be percent-encoded [RFC3986].
Because "fetch.txt" lists files that are absent from a sent bag,
receivers that are storing completed bags will want some way to
record that the bag no longer needs completing, such as renaming this
file (e.g., to "fetch-orig.txt") or changing a database flag; if bag
return is supported, the "fetch.txt" file will need to be modified as
appropriate to support the return transmission method. Receipt of a
bag is not final until all absent files are fetched. The receiver of
a bag with a "fetch.txt" tag file is expected promptly to complete
the bag by fetching all URL-identified components as the sender is
not bound to make the absent components available indefinitely.
The "fetch.txt" file essentially allows a bag to be transmitted with
"holes" in it, which can be practical for several reasons. For
example, it obviates the need for the sender to stage a large
serialized copy of the content until the bag is transferred to the
receiver. Also, this method allows a sender to construct a bag from
components that are either a subset of logically related components
(e.g., the localized logical object could be much larger than what is
intended for export) or assembled from logically distributed sources
(e.g., the object components for export are not stored locally under
one filesystem tree).
Boyko, et al. Expires January 12, 2009 [Page 7]
Internet-Draft BagIt July 2008
5. Tag file manifest: tagmanifest-<algorithm>.txt
Zero or more tag manifest files may be present. A tag manifest is a
top-level file with a name of the form tagmanifest-_algorithm_.txt,
where _algorithm_ is a string specifying a cryptographic checksum
algorithm. For example, a tag manifest file using SHA1 would have
the name
tagmanifest-sha1.txt
A tag manifest contains a list of tag files that must be present in a
fully constituted bag. It has the same form as the file manifest
described earlier, but must not list any payload files. As a result,
no FILENAME listed in a tag manifest begins "data/...".
Boyko, et al. Expires January 12, 2009 [Page 8]
Internet-Draft BagIt July 2008
6. Valid bags and complete bags
A bag is considered _valid_ if it is _complete_ and if each CHECKSUM
in every payload manifest and tag manifest can be verified against
the contents of its corresponding FILENAME.
A bag is considered _complete_ if every file in every manifest is
present, and if every payload file appears in at least one file
manifest. A payload file does not need to appear in every file
manifest as long as it appears in one file manifest (i.e., it must
belong to the "union" of file manifests). In a complete bag
containing one or more tag manifests, any tag file may appear in zero
or more of those manifests, but every tag file appearing in any tag
manifest must be present in the bag.
Because manifests do not list directories specifically, only
referencing them indirectly in file pathnames, they cannot account
for receipt of an empty directory. Nor can its creation be implied
using a "fetch.txt" file. To guarantee receipt of a directory, the
sender may wish to include at least one file; it suffices, for
example, to include a zero-length file named ".keep".
Boyko, et al. Expires January 12, 2009 [Page 9]
Internet-Draft BagIt July 2008
7. Other BagIt metadata: package-info.txt
Any other tag files are considered to be package information separate
from the payload content. The "data/" directory is the custodial
focus of a bag, and the top-level files comprising the tag are
intended to facilitate and document the transfer. The tag could also
be used to help in returning the bag to its sender at some point in
the future.
Tag information is optional. If present, tag information at a
minimum consists of a package-info.txt file. This is a text file
intended primarily for human readability using email-style headers
[RFC2822]. It is recommended that lines not exceed 79 characters in
length. As mentioned earlier, text is assumed to be Unicode encoded
as UTF-8.
The package-info.txt file contains metadata elements describing the
overall package. It looks like this.
Source-Organization: Spengler University
Organization-Address: 1400 Elm St., Cupertino, California, 95014
Contact-Name: Edna Janssen
Contact-Phone: +1 408-555-1212
Contact-Email: ej@spengler.edu
External-Description: Uncompressed greyscale TIFF images from the
Yoshimuri papers colle...
Packing-Date: 2008-01-15
External-Identifier: spengler_yoshimuri_001
Package-Size: 260 GB
Bag-Group-Identifier: spengler_yoshimuri
Bag-Count: 1 of 15
Internal-Sender-Identifier: /storage/images/yoshimuri
Internal-Sender-Description: Uncompressed greyscale TIFFs created from
microfilm and are...
All elements are provided as clues to ease handling on the sender and
receiver ends. No particular relationship between the sender
organization and the payload content is assumed; for example, the
sender may be a content aggregator, redistributor, collector,
curator, or producer.
Reserved element names are case-insensitive and defined as follows.
Source-Organization Organization transferring the content.
Boyko, et al. Expires January 12, 2009 [Page 10]
Internet-Draft BagIt July 2008
Organization-Address Mailing address of the organization.
Contact-Name Person at the source organization who is responsible
for the content transfer.
Contact-Phone International format telephone number of person or
position responsible.
Contact-Email Fully qualified email address of person or position
responsible.
External-Description A brief explanation of the contents and
provenance.
Packing-Date Date (YYYY-MM-DD) that the content was prepared for
delivery.
External-Identifier A sender-supplied identifier for the package.
This identifier must be unique across the sender's content, and if
recognizable as belonging to a globally unique scheme, the
receiver should make an effort to honor reference to it.
Package-Size Size or approximate size of the package being
transferred, followed by an abbreviation such as MB (megabytes),
GB, or TB; for example, 42600 MB, 42.6 GB, or .043 TB.
Bag-Group-Identifier (optional) A sender-supplied identifier for the
set, if any, of bags to which it logically belongs. This
identifier must be unique across the sender's content, and if
recognizable as belonging to a globally unique scheme, the
receiver should make an effort to honor reference to it.
Bag-Count (optional) Two numbers separated by "of", in particular,
"N of T", where T is the total number of bags in a group of bags
and N is the ordinal number within the group; if T is not known,
specify it as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of
?, 89 of 145.
Internal-Sender-Identifier (optional) An alternate sender-specific
identifier for the content and/or package. This value may be
useful to senders who may retrieve the content in the future. For
instance, it might contain values that are relevant to the re-use
of the content at the sender's organization.
Internal-Sender-Description (optional) A sender-local prose
description of the contents of the package, to assist in later use
if returned to the sender.
Boyko, et al. Expires January 12, 2009 [Page 11]
Internet-Draft BagIt July 2008
Arbitrary other package metadata elements may follow these elements.
Such elements could be used to describe the payload in ways intended
for the sender in case of bag return.
Boyko, et al. Expires January 12, 2009 [Page 12]
Internet-Draft BagIt July 2008
8. Bag serialization
In some scenarios, such as network transfer, it may be convenient for
the sender first to serialize the filesystem hierarchy representing
the bag (the outermost base directory) into a single-file archive
format such as TAR or ZIP. After receiving the resulting aggregate
file, which we will call a _serialization_, the receiver deserializes
it to recreate the filesystem hierarchy. Several rules govern the
serialization of a BagIt bag and apply equally to TAR or ZIP archive
files:
1. One and only one bag is contained in one serialization.
2. The serialization has the same name as the bag's base directory,
but with an extension added to identify the format; for example,
the receiver of "mybag.tar.gz" expects the corresponding base
directory to be created as "mybag".
3. A bag is never serialized from within its base directory, but
from the parent of the base directory (where the base directory
appears as an entry). Thus, after a bag is deserialized in an
empty directory, a listing of that directory shows exactly one
entry. For example, deserializing "mybag.zip" in an empty
directory causes the creation of the base directory "mybag" and,
beneath "mybag", the creation of all payload and tag files.
4. One un-archiving (deserialization) step produces a single base
directory bag with the top-level structure as described in this
document without requiring an additional un-archiving step. For
example, after one un-archiving step it would be an error for the
"data/" directory to appear as "data.tar.gz". TAR and ZIP files
may appear inside the payload beneath the "data/" directory,
where they would be treated as opaquely as any other payload file
or directory.
When preparing a bag in an archive file format, care must be taken to
ensure that the format's restrictions on file naming, such as
allowable characters, length, or character encoding, will support the
receiver's requirements.
Boyko, et al. Expires January 12, 2009 [Page 13]
Internet-Draft BagIt July 2008
9. Disk and network transfer
When recording a bag on physical media (such as hard disk, CD-ROM, or
DVD), the sender will need to select and format the media in a manner
compatible with the bag's content requirements for such things as
file names and sizes, and with the receiver's technical
infrastructure. If the receiver's infrastructure is not known or the
media needs to be compatible with a range of potential receivers,
consideration should be given to portability and common usage. For
example, a "lowest common denominator" for some content and potential
receivers could be USB disk drives formatted with the FAT32
filesystem.
Transmitting a whole bag in serialized form as a single file will
tend to be the most straightforward mode of transfer. When
throughput is a priority, use of "fetch.txt" lends itself to an easy,
application-level parallelism in which the list of URL-addressed
items to fetch is divided among multiple processes. The mechanics of
sending and receiving bags over networks is otherwise out of scope of
the present document and may be facilitated by protocols such as
[GRABIT].
Boyko, et al. Expires January 12, 2009 [Page 14]
Internet-Draft BagIt July 2008
10. Another example bag
Here's a bag of material resulting from a hypothetical web harvest.
As before, lines of file content are shown in parentheses beneath the
file name, with long lines continued indented on subsequent lines.
This bag is not completely retrieved, of course, until every
component listed in the "fetch.txt" file is retrieved.
Boyko, et al. Expires January 12, 2009 [Page 15]
Internet-Draft BagIt July 2008
mysecondbag/
|
| manifest-md5.txt
| (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt )
| (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt )
| (61c96810788283dc7be157b340e4eff4 data/gov-20060601-oth-050019.arc.gz )
| (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-img-100002.arc.gz )
|
| fetch.txt
| (http://WB20.Stanford.Edu/gov-06-2006-ARC/gov-20060601-oth-050019.arc.gz
| 26583985 data/gov-20060601-oth-050019.arc.gz )
| (http://WB20.Stanford.Edu/gov-06-2006-ARC/gov-20060601-img-100002.arc.gz
| 99509720 data/gov-20060601-img-100002.arc.gz )
| ( ................................................................... )
|
| package-info.txt
| (Source-organization: California Digital Library )
| (Organization-address: 415 20th Street, 4th Floor, Oakland, CA. 94612 )
| (Contact-name: A. E. Newman )
| (Contact-phone: +1 510-555-1234 )
| (Contact-email: alfred@ucop.edu )
| (External-Description: The collection "Local Davis Flood Control )
| Collection" includes captured California State and local websites )
| containing information on flood control resources for the Davis and )
| Sacramento area. Sites were captured by UC Davis curator Wrigley )
| Spyder using the Web Archiving Service in February 2007 and )
| October 2007. )
| (Packing-date: 2008.04.15 )
| (External-identifier: ark:/13030/fk4jm2bcp )
| (Package-size: about 22Gb )
| (Internal-sender-identifier: UCDL )
| (Internal-sender-description: University of California Davis Libraries)
|
| bagit.txt
| (BagIt-version: 0.9 )
| (Tag-File-Character-Encoding: UTF-8 )
|
\--- data/
|
| Collection Overview.txt
| (... narrative description ... )
|
| Seed List.txt
| (... list of crawler starting point URLs ... )
....
Boyko, et al. Expires January 12, 2009 [Page 16]
Internet-Draft BagIt July 2008
11. Interoperability: Windows and Unix file naming
Whether transmitted over a network or on physical media, the naming
of the files in the bag may be affected by differences in platforms
between the sending and receiving sides of the transfer. Besides the
fundamental difference between path separators ('\' and '/'),
generally, Windows filesystems have more limitations than Unix
filesystems. Windows path names have a maximum of 255 characters,
and none of these characters may be used in a path component:
< > : " / | ? *
Windows also reserves the following names: CON, PRN, AUX, NUL, COM1,
COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3,
LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. See [MSFNAM] for more
information.
Boyko, et al. Expires January 12, 2009 [Page 17]
Internet-Draft BagIt July 2008
12. Security considerations
The BagIt package format poses no direct risk to computers and
networks. Implementors of tools that complete bags by retrieving
URLs listed in a "fetch.txt" file need to be aware that some of those
URLs may point to hosts, intentionally or unintentionally, that are
not under control of the bag's sender. Checksum algorithms are
designed to protect against corruption and spoofing in bag transfer,
but they are not a guarantee.
Boyko, et al. Expires January 12, 2009 [Page 18]
Internet-Draft BagIt July 2008
13. References
[ENCDEP] Tabata, K., "A Collaboration Model between Archival
Systems to Enhance the Reliability of Preservation by an
Enclose-and-Deposit Method", 2005,
<http://www.iwaw.net/05/papers/iwaw05-tabata.pdf>.
[GRABIT] NDIIPP/CDL, "The GrabIt Package Exchange Protocol", 2008,
<http://dot.ucop.edu/home/jak/grabitspec.html>.
[MSFNAM] Microsoft, "Naming a File", 2008,
<http://msdn2.microsoft.com/en-us/library/aa365247.aspx>.
[RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321,
April 1992.
[RFC2822] Resnick, P., "Internet Message Format", RFC 2822,
April 2001.
[RFC3174] Eastlake, D. and P. Jones, "US Secure Hash Algorithm 1
(SHA1)", RFC 3174, September 2001.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003.
[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax", STD 66,
RFC 3986, January 2005.
Boyko, et al. Expires January 12, 2009 [Page 19]
Internet-Draft BagIt July 2008
Appendix A. Change history
(This appendix to be removed in the final draft.)
A.1. Changes from draft-01, 2008.05.30
Added mention of preserving empty directories.
Simplified function of "tag checksum file" to "tag manifest", having
same format as payload manifest. The tag manifest is optional and
need not include every tag file.
Loosened interpretation of payload manifest to "union" concept: every
payload file must be listed in at least one manifest but need not be
listed in every manifest.
Shortened the Introduction's first paragraph to be less duplicative
of text in the Abstract.
Changed Delivery-Date to Packing-Date.
Correctly sorted the author list and clarification of deserialization
wording.
A.2. Changes from draft-00, 2008.03.24
Author address corrections and miscellaneous stylistic edits.
Added some mention of physical media-based transfers, preferred
characteristics of transfer filesystems, and network transfer issues.
Added basic bag example early and changed the narrative to more
clearly delineate component files.
Wording changes under fetch.txt, and note that fetch.txt will need to
be modified before bag return.
Fixed checksum encoding reference to base64 rather than hex. (B.
Vargas)
Described simple normalization approach for checksum algorithm names.
(B. Vargas)
In the example bag, add the ARC files found in the fetch.txt to the
manifest as well (A. Turoff)
Boyko, et al. Expires January 12, 2009 [Page 20]
Internet-Draft BagIt July 2008
Authors' Addresses
Andy Boyko
Library of Congress
101 Independence Avenue SE
Washington, DC 20540
USA
Email: andy@boyko.net
John A. Kunze
California Digital Library
415 20th St, 4th Floor
Oakland, CA 94612
US
Fax: +1 510-893-5212
Email: jak@ucop.edu
Justin Littman
Library of Congress
101 Independence Avenue SE
Washington, DC 20540
USA
Fax: +1 202-707-1957
Email: jlit@loc.gov
Liz Madden
Library of Congress
101 Independence Avenue SE
Washington, DC 20540
USA
Fax: +1 202-707-1957
Email: emad@loc.gov
Boyko, et al. Expires January 12, 2009 [Page 21]
Internet-Draft BagIt July 2008
Full Copyright Statement
Copyright (C) The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Acknowledgment
Funding for the RFC Editor function is provided by the IETF
Administrative Support Activity (IASA).
Boyko, et al. Expires January 12, 2009 [Page 22]
Html markup produced by rfcmarkup 1.129d, available from
https://tools.ietf.org/tools/rfcmarkup/