draft-ietf-iri-3987bis-08.txt   draft-ietf-iri-3987bis-09.txt 
Internationalized Resource M. Duerst Internationalized Resource Identifiers M. Duerst
Identifiers (iri) Aoyama Gakuin University (iri) Aoyama Gakuin University
Internet-Draft M. Suignard Internet-Draft M. Suignard
Obsoletes: 3987 (if approved) Unicode Consortium Obsoletes: 3987 (if approved) Unicode Consortium
Intended status: Standards Track L. Masinter Intended status: Standards Track L. Masinter
Expires: April 24, 2012 Adobe Expires: July 12, 2012 Adobe
October 22, 2011 January 9, 2012
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-ietf-iri-3987bis-08 draft-ietf-iri-3987bis-09
Abstract Abstract
This document defines the Internationalized Resource Identifier (IRI) This document defines the Internationalized Resource Identifier (IRI)
protocol element, as an extension of the Uniform Resource Identifier protocol element, as an extension of the Uniform Resource Identifier
(URI). An IRI is a sequence of characters from the Universal (URI). An IRI is a sequence of characters from the Universal
Character Set (Unicode/ISO 10646). Grammar and processing rules are Character Set (Unicode/ISO 10646). Grammar and processing rules are
given for IRIs and related syntactic forms. given for IRIs and related syntactic forms.
In addition, this document provides named additional rule sets for
processing otherwise invalid IRIs, in a way that supports other
specifications that wish to mandate common behavior for 'error'
handling. In particular, rules used in some XML languages (LEIRI)
and web applications are given.
Defining IRI as new protocol element (rather than updating or Defining IRI as new protocol element (rather than updating or
extending the definition of URI) allows independent orderly extending the definition of URI) allows independent orderly
transitions: other protocols and languages that use URIs must transitions: other protocols and languages that use URIs must
explicitly choose to allow IRIs. explicitly choose to allow IRIs.
Guidelines are provided for the use and deployment of IRIs and Guidelines are provided for the use and deployment of IRIs and
related protocol elements when revising protocols, formats, and related protocol elements when revising protocols, formats, and
software components that currently deal only with URIs. software components that currently deal only with URIs.
This document is part of a set of documents intended to replace RFC
3987.
RFC Editor: Please remove the next paragraph before publication. RFC Editor: Please remove the next paragraph before publication.
This (and several companion documents) are intended to obsolete RFC This (and several companion documents) are intended to obsolete RFC
3987, and also move towards IETF Draft Standard. For discussion and 3987, and also move towards IETF Draft Standard. For discussion and
comments on these drafts, please join the IETF IRI WG by subscribing comments on these drafts, please join the IETF IRI WG by subscribing
to the mailing list public-iri@w3.org, archives at to the mailing list public-iri@w3.org, archives at
http://lists.w3.org/archives/public/public-iri/. For a list of open http://lists.w3.org/archives/public/public-iri/. For a list of open
issues, please see the issue tracker of the WG at issues, please see the issue tracker of the WG at
http://trac.tools.ietf.org/wg/iri/trac/report/1. For a list of http://trac.tools.ietf.org/wg/iri/trac/report/1. For a list of
individual edits, please see the change history at individual edits, please see the change history at
http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis. http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis.
Status of this Memo Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF). Note that other groups may also distribute
other groups may also distribute working documents as Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at This Internet-Draft will expire on July 12, 2012.
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 24, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the BSD License. described in the Simplified BSD License.
This document may contain material from IETF Documents or IETF This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this 10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process. modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
skipping to change at page 3, line 15 skipping to change at page 3, line 15
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5 1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5
1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6
1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7
1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 8
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 9 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 9
2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10
3. Processing IRIs and related protocol elements . . . . . . . . 12 3. Processing IRIs and related protocol elements . . . . . . . . 13
3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 13 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 13
3.2. Parse the IRI into IRI components . . . . . . . . . . . . 13 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 13
3.3. General percent-encoding of IRI components . . . . . . . 14 3.3. General percent-encoding of IRI components . . . . . . . 14
3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14
3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 14 3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 14
3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 14 3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 14
3.4.3. Additional Considerations . . . . . . . . . . . . . . 15 3.4.3. Additional Considerations . . . . . . . . . . . . . . 15
3.5. Mapping query components . . . . . . . . . . . . . . . . 16 3.5. Mapping query components . . . . . . . . . . . . . . . . 16
3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 16 3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 16
3.7. Converting URIs to IRIs . . . . . . . . . . . . . . . . . 16 4. Converting URIs to IRIs . . . . . . . . . . . . . . . . . . . 16
3.7.1. Examples . . . . . . . . . . . . . . . . . . . . . . . 18 4.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . 18
4. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 19 5.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 19
4.2. Software Interfaces and Protocols . . . . . . . . . . . . 20 5.2. Software Interfaces and Protocols . . . . . . . . . . . . 20
4.3. Format of URIs and IRIs in Documents and Protocols . . . 20 5.3. Format of URIs and IRIs in Documents and Protocols . . . 20
4.4. Use of UTF-8 for Encoding Original Characters . . . . . . 20 5.4. Use of UTF-8 for Encoding Original Characters . . . . . . 20
4.5. Relative IRI References . . . . . . . . . . . . . . . . . 22 5.5. Relative IRI References . . . . . . . . . . . . . . . . . 22
5. Liberal Handling of Otherwise Invalid IRIs . . . . . . . . . . 22 6. Legacy Extended IRIs (LEIRIs) . . . . . . . . . . . . . . . . 22
5.1. LEIRI Processing . . . . . . . . . . . . . . . . . . . . 22 6.1. Legacy Extended IRI Syntax . . . . . . . . . . . . . . . 23
6. Characters Disallowed or Not Recommended in IRIs . . . . . . . 23 6.2. Conversion of Legacy Extended IRIs to IRIs . . . . . . . 23
6.3. Characters Allowed in Legacy Extended IRIs but not in
IRIs . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 25 7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 25
7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 25 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 25
7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 25 7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 26
7.3. URI/IRI Transfer between Applications . . . . . . . . . . 26 7.3. URI/IRI Transfer between Applications . . . . . . . . . . 26
7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 27 7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 27
7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27
7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 28 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 28
7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 28 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 28
7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 29 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 29
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30
9. Security Considerations . . . . . . . . . . . . . . . . . . . 30 9. Security Considerations . . . . . . . . . . . . . . . . . . . 30
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31
11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 32 11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 32
skipping to change at page 3, line 51 skipping to change at page 4, line 4
7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27
7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 28 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 28
7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 28 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 28
7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 29 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 29
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30
9. Security Considerations . . . . . . . . . . . . . . . . . . . 30 9. Security Considerations . . . . . . . . . . . . . . . . . . . 30
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31
11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 32 11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 32
11.1. Split out Bidi, processing guidelines, comparison 11.1. Split out Bidi, processing guidelines, comparison
sections . . . . . . . . . . . . . . . . . . . . . . . . 32 sections . . . . . . . . . . . . . . . . . . . . . . . . 32
11.2. Major restructuring of IRI processing model . . . . . . . 32 11.2. Major restructuring of IRI processing model . . . . . . . 32
11.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 32 11.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 32
11.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 33 11.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 33
11.2.3. Extension of Syntax . . . . . . . . . . . . . . . . . 33 11.2.3. Extension of Syntax . . . . . . . . . . . . . . . . . 33
11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 33 11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 33
11.3. Change Log . . . . . . . . . . . . . . . . . . . . . . . 33 11.3. Change Log . . . . . . . . . . . . . . . . . . . . . . . 33
11.3.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 33 11.3.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 33
11.3.2. Changes from draft-duerst-iri-bis-07 to 11.3.2. Changes from draft-duerst-iri-bis-07 to
draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 33 draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 34
11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 33 11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 34
11.4. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 34 11.4. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 34
11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 34 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 34
11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 34 11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 34
11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 34 11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 34
11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 34 11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35
11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 34 11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35
11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 35 11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 35
11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 35 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 35
12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35
12.1. Normative References . . . . . . . . . . . . . . . . . . 35 12.1. Normative References . . . . . . . . . . . . . . . . . . 35
12.2. Informative References . . . . . . . . . . . . . . . . . 36 12.2. Informative References . . . . . . . . . . . . . . . . . 36
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39
1. Introduction 1. Introduction
1.1. Overview and Motivation 1.1. Overview and Motivation
skipping to change at page 5, line 31 skipping to change at page 5, line 31
transcribed with Latin letters. These transcriptions are now often transcribed with Latin letters. These transcriptions are now often
used in URIs, but they introduce additional difficulties. used in URIs, but they introduce additional difficulties.
The infrastructure for the appropriate handling of characters from The infrastructure for the appropriate handling of characters from
additional scripts is now widely deployed in operating system and additional scripts is now widely deployed in operating system and
application software. Software that can handle a wide variety of application software. Software that can handle a wide variety of
scripts and languages at the same time is increasingly common. Also, scripts and languages at the same time is increasingly common. Also,
an increasing number of protocols and formats can carry a wide range an increasing number of protocols and formats can carry a wide range
of characters. of characters.
URIs are used both as a protocol element (for transmission and URIs are composed out of a very limited repertoire of characters;
processing by software) and also a presentation element (for display this design choice was made to support global transcription([RFC3986]
and handling by people who read, interpret, coin, or guess them). section 1.2.1.). Reliable transition between a URI (as an abstract
The transition between these roles is more difficult and complex when protocol element composed of a sequence of characters) and a
dealing with the larger set of characters than allowed for URIs in presentation of that URI (written on a napkin, read out loud) and
[RFC3986]. back is relatively straightforward, because of the limited repertoire
of characters used. IRIs are designed to satisfy a different set of
use requirements; in particular, to allow IRIs to be written in ways
that are more meaningful to their users, even at the expense of
global transcribability. However, ensuring reliability of the
transition between an IRI and its presentation and back is more
difficult and complex when dealing with the larger set of Unicode
characters. For example, Unicode supports multiple ways of encoding
complex combinations of characters and accents, with multiple
character sequences that can result in the same presentation.
This document defines the protocol element called Internationalized This document defines the protocol element called Internationalized
Resource Identifier (IRI), which allow applications of URIs to be Resource Identifier (IRI), which allow applications of URIs to be
extended to use resource identifiers that have a much wider extended to use resource identifiers that have a much wider
repertoire of characters. It also provides corresponding repertoire of characters. It also provides corresponding
"internationalized" versions of other constructs from [RFC3986], such "internationalized" versions of other constructs from [RFC3986], such
as URI references. The syntax of IRIs is defined in Section 2. as URI references. The syntax of IRIs is defined in Section 2.
Using characters outside of A - Z in IRIs adds a number of Within this document, Section 5 discusses the use of IRIs in
difficulties. Section 4 discusses the use of IRIs in different different situations. Section 7 gives additional informative
situations. Section 7 gives additional informative guidelines. guidelines. Section 9 discusses IRI-specific security
Section 9 discusses IRI-specific security considerations. considerations.
[Bidi] discusses the special case of bidirectional IRIs using
characters from scripts written right-to-left. [Equivalence] gives
guidelines for applications wishing to determine if two IRIs are
equivalent, as well as defining some equivalence methods.
[RFC4395bis] updates the URI scheme registration guidelines and
proceedures to note that every URI scheme is also automatically an
IRI scheme and to allow scheme definitions to be directly described
in terms of Unicode characters.
When originally defining IRIs, several design alternatives were This specification is part of a collection of specifications intended
considered. Historically interested readers can find an overview in to replace [RFC3987]. [Bidi] discusses the special case of
Appendix A of [RFC3987]. For some additional background on the bidirectional IRIs using characters from scripts written right-to-
design of URIs and IRIs, please also see [Gettys]. left. [Equivalence] gives guidelines for applications wishing to
determine if two IRIs are equivalent, as well as defining some
equivalence methods. [RFC4395bis] updates the URI scheme
registration guidelines and procedures to note that every URI scheme
is also automatically an IRI scheme and to allow scheme definitions
to be directly described in terms of Unicode characters.
1.2. Applicability 1.2. Applicability
IRIs are designed to allow protocols and software that deal with URIs IRIs are designed to allow protocols and software that deal with URIs
to be updated to handle IRIs. Processing of IRIs is accomplished by to be updated to handle IRIs. Processing of IRIs is accomplished by
extending the URI syntax while retaining (and not expanding) the set extending the URI syntax while retaining (and not expanding) the set
of "reserved" characters, such that the syntax for any URI scheme may of "reserved" characters, such that the syntax for any URI scheme may
be extended to allow non-ASCII characters. In addition, following be extended to allow non-ASCII characters. In addition, following
parsing of an IRI, it is possible to construct a corresponding URI by parsing of an IRI, it is possible to construct a corresponding URI by
first encoding characters outside of the allowed URI range and then first encoding characters outside of the allowed URI range and then
skipping to change at page 7, line 6 skipping to change at page 7, line 10
c. The URI scheme definition, if it explicitly allows a percent sign c. The URI scheme definition, if it explicitly allows a percent sign
("%") in any syntactic component, SHOULD define the interpretation ("%") in any syntactic component, SHOULD define the interpretation
of sequences of percent-encoded octets (using "%XX" hex octets) as of sequences of percent-encoded octets (using "%XX" hex octets) as
octet from sequences of UTF-8 encoded strings; this is recommended octet from sequences of UTF-8 encoded strings; this is recommended
in the guidelines for registering new schemes, [RFC4395bis]. For in the guidelines for registering new schemes, [RFC4395bis]. For
example, this is the practice for IMAP URLs [RFC2192], POP URLs example, this is the practice for IMAP URLs [RFC2192], POP URLs
[RFC2384] and the URN syntax [RFC2141]). Note that use of [RFC2384] and the URN syntax [RFC2141]). Note that use of
percent-encoding may also be restricted in some situations, for percent-encoding may also be restricted in some situations, for
example, URI schemes that disallow percent-encoding might still be example, URI schemes that disallow percent-encoding might still be
used with a fragment identifier which is percent-encoded (e.g., used with a fragment identifier which is percent-encoded (e.g.,
[XPointer]). See Section 4.4 for further discussion. [XPointer]). See Section 5.4 for further discussion.
1.3. Definitions 1.3. Definitions
The following definitions are used in this document; they follow the The following definitions are used in this document; they follow the
terms in [RFC2130], [RFC2277], and [ISO10646]. terms in [RFC2130], [RFC2277], and [ISO10646].
character: A member of a set of elements used for the organization, character: A member of a set of elements used for the organization,
control, or representation of data. For example, "LATIN CAPITAL control, or representation of data. For example, "LATIN CAPITAL
LETTER A" names a character. LETTER A" names a character.
skipping to change at page 7, line 49 skipping to change at page 8, line 8
Resource Identifier. An IRI reference may be absolute or Resource Identifier. An IRI reference may be absolute or
relative. However, the "IRI" that results from such a reference relative. However, the "IRI" that results from such a reference
only includes absolute IRIs; any relative IRI references are only includes absolute IRIs; any relative IRI references are
resolved to their absolute form. Note that in [RFC2396] URIs did resolved to their absolute form. Note that in [RFC2396] URIs did
not include fragment identifiers, but in [RFC3986] fragment not include fragment identifiers, but in [RFC3986] fragment
identifiers are part of URIs. identifiers are part of URIs.
LEIRI (Legacy Extended IRI) processing: This term was used in LEIRI (Legacy Extended IRI) processing: This term was used in
various XML specifications to refer to strings that, although not various XML specifications to refer to strings that, although not
valid IRIs, were acceptable input to the processing rules in valid IRIs, were acceptable input to the processing rules in
Section 5.1. Section 6.2.
running text: Human text (paragraphs, sentences, phrases) with running text: Human text (paragraphs, sentences, phrases) with
syntax according to orthographic conventions of a natural syntax according to orthographic conventions of a natural
language, as opposed to syntax defined for ease of processing by language, as opposed to syntax defined for ease of processing by
machines (e.g., markup, programming languages). machines (e.g., markup, programming languages).
protocol element: Any portion of a message that affects processing protocol element: Any portion of a message that affects processing
of that message by the protocol in question. of that message by the protocol in question.
presentation element: A presentation form corresponding to a
protocol element; for example, using a wider range of characters.
create (a URI or IRI): With respect to URIs and IRIs, the term is create (a URI or IRI): With respect to URIs and IRIs, the term is
used for the initial creation. This may be the initial creation used for the initial creation. This may be the initial creation
of a resource with a certain identifier, or the initial exposition of a resource with a certain identifier, or the initial exposition
of a resource under a particular identifier. of a resource under a particular identifier.
generate (a URI or IRI): With respect to URIs and IRIs, the term is generate (a URI or IRI): With respect to URIs and IRIs, the term is
used when the identifier is generated by derivation from other used when the identifier is generated by derivation from other
information. information.
parsed URI component: When a URI processor parses a URI (following parsed URI component: When a URI processor parses a URI (following
skipping to change at page 9, line 41 skipping to change at page 9, line 44
the containing protocol or document ensures that the characters in the containing protocol or document ensures that the characters in
the IRI can be handled (e.g., searched, converted, displayed) in the the IRI can be handled (e.g., searched, converted, displayed) in the
same way as the rest of the protocol or document. same way as the rest of the protocol or document.
2.1. Summary of IRI Syntax 2.1. Summary of IRI Syntax
The IRI syntax extends the URI syntax in [RFC3986] by extending the The IRI syntax extends the URI syntax in [RFC3986] by extending the
class of unreserved characters, primarily by adding the characters of class of unreserved characters, primarily by adding the characters of
the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject
to the limitations given in the syntax rules below and in to the limitations given in the syntax rules below and in
Section 4.1. Section 5.1.
The syntax and use of components and reserved characters is the same The syntax and use of components and reserved characters is the same
as that in [RFC3986]. Each "URI scheme" thus also functions as an as that in [RFC3986]. Each "URI scheme" thus also functions as an
"IRI scheme", in that scheme-specific parsing rules for URIs of a "IRI scheme", in that scheme-specific parsing rules for URIs of a
scheme are be extended to allow parsing of IRIs using the same scheme are be extended to allow parsing of IRIs using the same
parsing rules. parsing rules.
All the operations defined in [RFC3986], such as the resolution of All the operations defined in [RFC3986], such as the resolution of
relative references, can be applied to IRIs by IRI-processing relative references, can be applied to IRIs by IRI-processing
software in exactly the same way as they are for URIs by URI- software in exactly the same way as they are for URIs by URI-
skipping to change at page 13, line 25 skipping to change at page 13, line 29
steps are scheme specific. steps are scheme specific.
3.1. Converting to UCS 3.1. Converting to UCS
Input that is already in a Unicode form (i.e., a sequence of Unicode Input that is already in a Unicode form (i.e., a sequence of Unicode
characters or an octet-stream representing a Unicode-based character characters or an octet-stream representing a Unicode-based character
encoding such as UTF-8 or UTF-16) should be left as is and not encoding such as UTF-8 or UTF-16) should be left as is and not
normalized or changed. normalized or changed.
An IRI or IRI reference is a sequence of characters from the UCS. An IRI or IRI reference is a sequence of characters from the UCS.
For resource identifiers that are not already in a Unicode form (as For input from presentations (written on paper, read aloud) or
when written on paper, read aloud, or represented in a text stream translation from other representations (a text stream using a legacy
using a legacy character encoding), convert the IRI to Unicode. Note character encoding), convert the input to Unicode. Note that some
that some character encodings or transcriptions can be converted to character encodings or transcriptions can be converted to or
or represented by more than one sequence of Unicode characters. represented by more than one sequence of Unicode characters. Ideally
Ideally the resulting IRI would use a normalized form, such as the resulting IRI would use a normalized form, such as Unicode
Unicode Normalization Form C [UTR15], since that ensures a stable, Normalization Form C [UTR15], since that ensures a stable, consistent
consistent representation that is most likely to produce the intended representation that is most likely to produce the intended results.
results. Implementers and users are cautioned that, while Previous versions of this specification required normalization at
denormalized character sequences are valid, they might be difficult this step. However, attempts to require normalization in other
for other users or processes to reproduce and might lead to protocols have met with strong enough resistance that requiring
unexpected results. normalization here was considered impractical. Implementers and
users are cautioned that, while denormalized character sequences are
In other cases (written on paper, read aloud, or otherwise valid, they might be difficult for other users or processes to
represented independent of any character encoding) represent the IRI reproduce and might lead to unexpected results.
as a sequence of characters from the UCS normalized according to
Unicode Normalization Form C (NFC, [UTR15]).
3.2. Parse the IRI into IRI components 3.2. Parse the IRI into IRI components
Parse the IRI, either as a relative reference (no scheme) or using Parse the IRI, either as a relative reference (no scheme) or using
scheme specific processing (according to the scheme given); the scheme specific processing (according to the scheme given); the
result is a set of parsed IRI components. result is a set of parsed IRI components.
3.3. General percent-encoding of IRI components 3.3. General percent-encoding of IRI components
Except as noted in the following subsections, IRI components are Except as noted in the following subsections, IRI components are
skipping to change at page 16, line 25 skipping to change at page 16, line 25
than UTF-8 as the binary representation before pct-encoding. This than UTF-8 as the binary representation before pct-encoding. This
mapping is not applied for any other scheme or component. mapping is not applied for any other scheme or component.
3.6. Mapping IRIs to URIs 3.6. Mapping IRIs to URIs
The mapping from an IRI to URI is accomplished by applying the The mapping from an IRI to URI is accomplished by applying the
mapping above (from IRI to URI components) and then reassembling a mapping above (from IRI to URI components) and then reassembling a
URI from the parsed URI components using the original punctuation URI from the parsed URI components using the original punctuation
that delimited the IRI components. that delimited the IRI components.
3.7. Converting URIs to IRIs 4. Converting URIs to IRIs
In some situations, for presentation and further processing, it is In some situations, for presentation and further processing, it is
desirable to convert a URI into an equivalent IRI without unnecessary desirable to convert a URI into an equivalent IRI without unnecessary
percent encoding. Of course, every URI is already an IRI in its own percent encoding. Of course, every URI is already an IRI in its own
right without any conversion. This section gives one possible right without any conversion. This section gives one possible
procedure for URI to IRI mapping. procedure for URI to IRI mapping.
The conversion described in this section, if given a valid URI, will The conversion described in this section, if given a valid URI, will
result in an IRI that maps back to the URI used as an input for the result in an IRI that maps back to the URI used as an input for the
conversion (except for potential case differences in percent-encoding conversion (except for potential case differences in percent-encoding
skipping to change at page 17, line 8 skipping to change at page 17, line 9
2. Some percent-encodings cannot be interpreted as sequences of UTF-8 2. Some percent-encodings cannot be interpreted as sequences of UTF-8
octets. octets.
(Note: The octet patterns of UTF-8 are highly regular. Therefore, (Note: The octet patterns of UTF-8 are highly regular. Therefore,
there is a very high probability, but no guarantee, that percent- there is a very high probability, but no guarantee, that percent-
encodings that can be interpreted as sequences of UTF-8 octets encodings that can be interpreted as sequences of UTF-8 octets
actually originated from UTF-8. For a detailed discussion, see actually originated from UTF-8. For a detailed discussion, see
[Duerst97].) [Duerst97].)
3. The conversion may result in a character that is not appropriate 3. The conversion may result in a character that is not appropriate
in an IRI. See Section 2.2, and Section 4.1 for further details. in an IRI. See Section 2.2, and Section 5.1 for further details.
4. IRI to URI conversion has different rules for dealing with domain 4. IRI to URI conversion has different rules for dealing with domain
names and query parameters. names and query parameters.
Conversion from a URI to an IRI MAY be done by using the following Conversion from a URI to an IRI MAY be done by using the following
steps: steps:
1. Represent the URI as a sequence of octets in US-ASCII. 1. Represent the URI as a sequence of octets in US-ASCII.
2. Convert all percent-encodings ("%" followed by two hexadecimal 2. Convert all percent-encodings ("%" followed by two hexadecimal
digits) to the corresponding octets, except those corresponding to digits) to the corresponding octets, except those corresponding to
"%", characters in "reserved", and characters in US-ASCII not "%", characters in "reserved", and characters in US-ASCII not
allowed in URIs. allowed in URIs.
3. Re-percent-encode any octet produced in step 2 that is not part of 3. Re-percent-encode any octet produced in step 2 that is not part of
a strictly legal UTF-8 octet sequence. a strictly legal UTF-8 octet sequence.
4. Re-percent-encode all octets produced in step 3 that in UTF-8 4. Re-percent-encode all octets produced in step 3 that in UTF-8
represent characters that are not appropriate according to represent characters that are not appropriate according to
Section 2.2 and Section 4.1. Section 2.2 and Section 5.1.
5. Interpret the resulting octet sequence as a sequence of characters 5. Interpret the resulting octet sequence as a sequence of characters
encoded in UTF-8. encoded in UTF-8.
6. URIs known to contain domain names in the reg-name component 6. URIs known to contain domain names in the reg-name component
SHOULD convert punycode-encoded domain name labels to the SHOULD convert punycode-encoded domain name labels to the
corresponding characters using the ToUnicode procedure. corresponding characters using the ToUnicode procedure.
This procedure will convert as many percent-encoded characters as This procedure will convert as many percent-encoded characters as
possible to characters in an IRI. Because there are some choices possible to characters in an IRI. Because there are some choices
when step 4 is applied (see Section 4.1), results may vary. when step 4 is applied (see Section 5.1), results may vary.
Conversions from URIs to IRIs MUST NOT use any character encoding Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was guess from the context that another character encoding than UTF-8 was
used in the URI. For example, the URI used in the URI. For example, the URI
"http://www.example.org/r%E9sum%E9.html" might with some guessing be "http://www.example.org/r%E9sum%E9.html" might with some guessing be
interpreted to contain two e-acute characters encoded as iso-8859-1. interpreted to contain two e-acute characters encoded as iso-8859-1.
It must not be converted to an IRI containing these e-acute It must not be converted to an IRI containing these e-acute
characters. Otherwise, in the future the IRI will be mapped to characters. Otherwise, in the future the IRI will be mapped to
"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
URI from "http://www.example.org/r%E9sum%E9.html". URI from "http://www.example.org/r%E9sum%E9.html".
3.7.1. Examples 4.1. Examples
This section shows various examples of converting URIs to IRIs. Each This section shows various examples of converting URIs to IRIs. Each
example shows the result after each of the steps 1 through 6 is example shows the result after each of the steps 1 through 6 is
applied. XML Notation is used for the final result. Octets are applied. XML Notation is used for the final result. Octets are
denoted by "<" followed by two hexadecimal digits followed by ">". denoted by "<" followed by two hexadecimal digits followed by ">".
The following example contains the sequence "%C3%BC", which is a The following example contains the sequence "%C3%BC", which is a
strictly legal UTF-8 sequence, and which is converted into the actual strictly legal UTF-8 sequence, and which is converted into the actual
character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as
u-umlaut). u-umlaut).
skipping to change at page 19, line 26 skipping to change at page 19, line 26
4. http://xn--99zt52a.example.org/%E2%80%AE 4. http://xn--99zt52a.example.org/%E2%80%AE
5. http://xn--99zt52a.example.org/%E2%80%AE 5. http://xn--99zt52a.example.org/%E2%80%AE
6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE 6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE
Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46 Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
(Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this (Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this
note.)) note.))
4. Use of IRIs 5. Use of IRIs
4.1. Limitations on UCS Characters Allowed in IRIs 5.1. Limitations on UCS Characters Allowed in IRIs
This section discusses limitations on characters and character This section discusses limitations on characters and character
sequences usable for IRIs beyond those given in Section 2.2. The sequences usable for IRIs beyond those given in Section 2.2. The
considerations in this section are relevant when IRIs are created and considerations in this section are relevant when IRIs are created and
when URIs are converted to IRIs. when URIs are converted to IRIs.
a. The repertoire of characters allowed in each IRI component is a. The repertoire of characters allowed in each IRI component is
limited by the definition of that component. For example, the limited by the definition of that component. For example, the
definition of the scheme component does not allow characters definition of the scheme component does not allow characters
beyond US-ASCII. beyond US-ASCII.
skipping to change at page 20, line 10 skipping to change at page 20, line 10
the full-width equivalents of Latin characters, half-width the full-width equivalents of Latin characters, half-width
Katakana characters for Japanese, and many others. It also Katakana characters for Japanese, and many others. It also
includes many look-alikes of "space", "delims", and "unwise", includes many look-alikes of "space", "delims", and "unwise",
characters excluded in [RFC3491]. characters excluded in [RFC3491].
Additional information is available from [UNIXML]. [UNIXML] is Additional information is available from [UNIXML]. [UNIXML] is
written in the context of running text rather than in that of written in the context of running text rather than in that of
identifiers. Nevertheless, it discusses many of the categories of identifiers. Nevertheless, it discusses many of the categories of
characters not appropriate for IRIs. characters not appropriate for IRIs.
4.2. Software Interfaces and Protocols 5.2. Software Interfaces and Protocols
Although an IRI is defined as a sequence of characters, software Although an IRI is defined as a sequence of characters, software
interfaces for URIs typically function on sequences of octets or interfaces for URIs typically function on sequences of octets or
other kinds of code units. Thus, software interfaces and protocols other kinds of code units. Thus, software interfaces and protocols
MUST define which character encoding is used. MUST define which character encoding is used.
Intermediate software interfaces between IRI-capable components and Intermediate software interfaces between IRI-capable components and
URI-only components MUST map the IRIs per Section 3.6, when URI-only components MUST map the IRIs per Section 3.6, when
transferring from IRI-capable to URI-only components. This mapping transferring from IRI-capable to URI-only components. This mapping
SHOULD be applied as late as possible. It SHOULD NOT be applied SHOULD be applied as late as possible. It SHOULD NOT be applied
between components that are known to be able to handle IRIs. between components that are known to be able to handle IRIs.
4.3. Format of URIs and IRIs in Documents and Protocols 5.3. Format of URIs and IRIs in Documents and Protocols
Document formats that transport URIs may have to be upgraded to allow Document formats that transport URIs may have to be upgraded to allow
the transport of IRIs. In cases where the document as a whole has a the transport of IRIs. In cases where the document as a whole has a
native character encoding, IRIs MUST also be encoded in this native character encoding, IRIs MUST also be encoded in this
character encoding and converted accordingly by a parser or character encoding and converted accordingly by a parser or
interpreter. IRI characters not expressible in the native character interpreter. IRI characters not expressible in the native character
encoding SHOULD be escaped by using the escaping conventions of the encoding SHOULD be escaped by using the escaping conventions of the
document format if such conventions are available. Alternatively, document format if such conventions are available. Alternatively,
they MAY be percent-encoded according to Section 3.6. For example, they MAY be percent-encoded according to Section 3.6. For example,
in HTML or XML, numeric character references SHOULD be used. If a in HTML or XML, numeric character references SHOULD be used. If a
skipping to change at page 20, line 46 skipping to change at page 20, line 46
the document in the UTF-8 character encoding. the document in the UTF-8 character encoding.
((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs, ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
although they use different terminology. HTML 4.0 [HTML4] defines although they use different terminology. HTML 4.0 [HTML4] defines
the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0 the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0
[XML1], XLink [XLink], XML Schema [XMLSchema], and specifications [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications
based upon them allow IRIs. Also, it is expected that all relevant based upon them allow IRIs. Also, it is expected that all relevant
new W3C formats and protocols will be required to handle IRIs new W3C formats and protocols will be required to handle IRIs
[CharMod]. [CharMod].
4.4. Use of UTF-8 for Encoding Original Characters 5.4. Use of UTF-8 for Encoding Original Characters
This section discusses details and gives examples for point c) in This section discusses details and gives examples for point c) in
Section 1.2. To be able to use IRIs, the URI corresponding to the Section 1.2. To be able to use IRIs, the URI corresponding to the
IRI in question has to encode original characters into octets by IRI in question has to encode original characters into octets by
using UTF-8. This can be specified for all URIs of a URI scheme or using UTF-8. This can be specified for all URIs of a URI scheme or
can apply to individual URIs for schemes that do not specify how to can apply to individual URIs for schemes that do not specify how to
encode original characters. It can apply to the whole URI, or only encode original characters. It can apply to the whole URI, or only
to some part. For background information on encoding characters into to some part. For background information on encoding characters into
URIs, see also Section 2.5 of [RFC3986]. URIs, see also Section 2.5 of [RFC3986].
skipping to change at page 22, line 27 skipping to change at page 22, line 27
document name is encoded in iso-8859-1 based on server settings, but document name is encoded in iso-8859-1 based on server settings, but
where the fragment identifier is encoded in UTF-8 according to where the fragment identifier is encoded in UTF-8 according to
[XPointer]. The IRI corresponding to the above URI would be (in XML [XPointer]. The IRI corresponding to the above URI would be (in XML
notation) notation)
"http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;". "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;".
Similar considerations apply to query parts. The functionality of Similar considerations apply to query parts. The functionality of
IRIs (namely, to be able to include non-ASCII characters) can only be IRIs (namely, to be able to include non-ASCII characters) can only be
used if the query part is encoded in UTF-8. used if the query part is encoded in UTF-8.
4.5. Relative IRI References 5.5. Relative IRI References
Processing of relative IRI references against a base is handled Processing of relative IRI references against a base is handled
straightforwardly; the algorithms of [RFC3986] can be applied straightforwardly; the algorithms of [RFC3986] can be applied
directly, treating the characters additionally allowed in IRI directly, treating the characters additionally allowed in IRI
references in the same way that unreserved characters are in URI references in the same way that unreserved characters are in URI
references. references.
5. Liberal Handling of Otherwise Invalid IRIs 6. Legacy Extended IRIs (LEIRIs)
Some technical specifications and widely-deployed software have For historic reasons, some formats have allowed variants of IRIs that
allowed additional variations and extensions of IRIs to be used in are somewhat less restricted in syntax. This section provides a
syntactic components. definition and a name (Legacy Extended IRI or LEIRI) for these
variants for easier reference. These variants have to be used with
care; they require further processing before being fully
interchangeable as IRIs. New protocols and formats SHOULD NOT use
Legacy Extended IRIs. Even where Legacy Extended IRIs are allowed,
only IRIs fully conforming to the syntax definition in Section 2.2
SHOULD be created, generated, and used. The provisions in this
section also apply to Legacy Extended IRI references.
Future technical specifications SHOULD NOT allow conforming producers 6.1. Legacy Extended IRI Syntax
to produce, or conforming content to contain, such forms, as they are
not interoperable with other IRI consuming software.
5.1. LEIRI Processing The syntax of Legacy Extended IRIs is the same as that for IRIs,
except that ucschar is redefined as follows:
This section defines Legacy Extended IRIs (LEIRIs). The syntax of ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|"
Legacy Extended IRIs is the same as that for <IRI-reference>, except / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
that the ucschar production is replaced by the leiri-ucschar / %xE000-FFFD / %x10000-10FFFF
production:
leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" The restriction on bidirectional formatting characters in [Bidi] is
/ "\" / "^" / "`" / %x0-1F / %x7F-D7FF lifted. The iprivate production becomes redundant.
/ %xE000-FFFD / %x10000-10FFFF
Among other extensions, processors based on this specification also Likewise, the syntax for Legacy Extended IRI references (LEIRI
did not enforce the restriction on bidirectional formatting references) is the same as that for IRI references with the above
characters in [Bidi], and the iprivate production becomes redundant. redefinition of ucschar applied.
To convert a string allowed as a LEIRI to an IRI, each character Formats that use Legacy Extended IRIs or Legacy Extended IRI
allowed in leiri-ucschar but not in ucschar must be percent-encoded references MAY further restrict the characters allowed therein,
using Section 3.3. either implicitly by the fact that the format as such does not allow
some characters, or explicitly. An example of a character not
allowed implicitly may be the NUL character (U+0000). However, all
the characters allowed in IRIs MUST still be allowed.
6. Characters Disallowed or Not Recommended in IRIs 6.2. Conversion of Legacy Extended IRIs to IRIs
To convert a Legacy Extended IRI (reference) to an IRI (reference),
each character allowed in a Legacy Extended IRI (reference) but not
allowed in an IRI (reference) (see Section 6.3) MUST be percent-
encoded by applying steps 2.1 to 2.3 of Section 3.6.
6.3. Characters Allowed in Legacy Extended IRIs but not in IRIs
This section provides a list of the groups of characters and code This section provides a list of the groups of characters and code
points that are allowed in some contexts but are not allowed in IRIs points that are allowed in Legacy Extedend IRIs, but are not allowed
or are allowed in IRIs only in the query part. For each group of in IRIs or are allowed in IRIs only in the query part. For each
characters, advice on the usage of these characters is also given, group of characters, advice on the usage of these characters is also
concentrating on the reasons for why they are excluded from IRI use. given, concentrating on the reasons for why not to use them.
Space (U+0020): Some formats and applications use space as a Space (U+0020): Some formats and applications use space as a
delimiter, e.g. for items in a list. Appendix C of [RFC3986] also delimiter, e.g. for items in a list. Appendix C of [RFC3986] also
mentions that white space may have to be added when displaying or mentions that white space may have to be added when displaying or
printing long URIs; the same applies to long IRIs. This means printing long URIs; the same applies to long IRIs. This means
that spaces can disappear, or can make the what is intended as a that spaces can disappear, or can make the Legacy Extended IRI to
single IRI or IRI reference to be treated as two or more separate be interpreted as two or more separate IRIs.
IRIs.
Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix
C of [RFC3986] suggests the use of double-quotes C of [RFC3986] suggests the use of double-quotes
("http://example.com/") and angle brackets (<http://example.com/>) ("http://example.com/") and angle brackets (<http://example.com/>)
as delimiters for URIs in plain text. These conventions are often as delimiters for URIs in plain text. These conventions are often
used, and also apply to IRIs. Using these characters in strings used, and also apply to IRIs. Legacy Extended IRIs using these
intended to be IRIs would result in the IRIs being cut off at the characters will be cut off at the wrong place.
wrong place.
Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{" Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{"
(U+007B), "|" (U+007C), and "}" (U+007D): These characters (U+007B), "|" (U+007C), and "}" (U+007D): These characters
originally have been excluded from URIs because the respective originally have been excluded from URIs because the respective
codepoints are assigned to different graphic characters in some codepoints are assigned to different graphic characters in some
7-bit or 8-bit encoding. Despite the move to Unicode, some of 7-bit or 8-bit encoding. Despite the move to Unicode, some of
these characters are still occasionally displayed differently on these characters are still occasionally displayed differently on
some systems, e.g. U+005C may appear as a Japanese Yen symbol on some systems, e.g. U+005C as a Japanese Yen symbol. Also, the
some systems. Also, the fact that these characters are not used fact that these characters are not used in URIs or IRIs has
in URIs or IRIs has encouraged their use outside URIs or IRIs in encouraged their use outside URIs or IRIs in contexts that may
contexts that may include URIs or IRIs. If a string with such a include URIs or IRIs. In case a Legacy Extended IRI with such a
character were used as an IRI in such a context, it would likely character is used in such a context, the Legacy Extended IRI will
be interpreted piecemeal. be interpreted piecemeal.
The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F - The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
#x9F): There is generally no way to transmit these characters #x9F): There is no way to transmit these characters reliably
reliably as text outside of a charset encoding. Even when in except potentially in electronic form. Even when in electronic
encoded form, many software components silently filter out some of form, some software components might silently filter out some of
these characters, or may stop processing alltogether when these characters, or may stop processing alltogether when
encountering some of them. These characters may affect text encountering some of them. These characters may affect text
display in subtle, unnoticable ways or in drastic, global, and display in subtle, unnoticable ways or in drastic, global, and
irreversible ways depending on the hardware and software involved. irreversible ways depending on the hardware and software involved.
The use of some of these characters would allow malicious users to The use of some of these characters may allow malicious users to
manipulate the display of an IRI and its context in many manipulate the display of a Legacy Extended IRI and its context.
situations.
Bidi formatting characters (U+200E, U+200F, U+202A-202E): These Bidi formatting characters (U+200E, U+200F, U+202A-202E): These
characters affect the display ordering of characters. If IRIs characters affect the display ordering of characters. Displayed
were allowed to contain these characters and the resulting visual Legacy Extended IRIs containing these characters cannot be
display transcribed. they could not be converted back to converted back to electronic form (logical order) unambiguously.
electronic form (logical order) unambiguously. These characters, These characters may allow malicious users to manipulate the
if allowed in IRIs, might allow malicious users to manipulate the display of a Legacy Extended IRI and its context.
display of IRI and its context.
Specials (U+FFF0-FFFD): These code points provide functionality Specials (U+FFF0-FFFD): These code points provide functionality
beyond that useful in an IRI, for example byte order beyond that useful in a Legacy Extended IRI, for example byte
identification, annotation, and replacements for unknown order identification, annotation, and replacements for unknown
characters and objects. Their use and interpretation in an IRI characters and objects. Their use and interpretation in a Legacy
would serve no purpose and might lead to confusing display Extended IRI serves no purpose and may lead to confusing display
variations. variations.
Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000- Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
10FFFD): Display and interpretation of these code points is by 10FFFD): Display and interpretation of these code points is by
definition undefined without private agreement. In any case, definition undefined without private agreement. Therefore, these
these code points are not suited for use on the Internet. They code points are not suited for use on the Internet. They are not
are not interoperable and may have unpredictable effects. interoperable and may have unpredictable effects.
Tags (U+E0000-E0FFF): These characters were intended to provide a Tags (U+E0000-E0FFF): These characters provide a way to language
way to language tag in Unicode plain text. They are now tag in Unicode plain text. They are not appropriate for Legacy
deprecated [RFC6082]. In any case, they would not be appropriate Extended IRIs because language information in identifiers cannot
for IRIs because language information in identifiers cannot
reliably be input, transmitted (e.g. on a visual medium such as reliably be input, transmitted (e.g. on a visual medium such as
paper), or recognized. paper), or recognized.
Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF, Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF, U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF, U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF, U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
non-characters. Applications may use some of them internally, but non-characters. Applications may use some of them internally, but
are not prepared to interchange them. are not prepared to interchange them.
LEIRI preprocessing disallowed some code points and code units: For reference, we here also list the code points and code units not
even allowed in Legacy Extended IRIs:
Surrogate code units (D800-DFFF): These do not represent Unicode Surrogate code units (D800-DFFF): These do not represent Unicode
codepoints. codepoints.
7. URI/IRI Processing Guidelines (Informative) 7. URI/IRI Processing Guidelines (Informative)
This informative section provides guidelines for supporting IRIs in This informative section provides guidelines for supporting IRIs in
the same software components and operations that currently process the same software components and operations that currently process
URIs: Software interfaces that handle URIs, software that allows URIs: Software interfaces that handle URIs, software that allows
users to enter URIs, software that creates or generates URIs, users to enter URIs, software that creates or generates URIs,
skipping to change at page 25, line 37 skipping to change at page 25, line 51
that are designed to carry IRIs. that are designed to carry IRIs.
In case the current handling in an API or protocol is based on US- In case the current handling in an API or protocol is based on US-
ASCII, UTF-8 is recommended as the character encoding for IRIs, as it ASCII, UTF-8 is recommended as the character encoding for IRIs, as it
is compatible with US-ASCII, is in accordance with the is compatible with US-ASCII, is in accordance with the
recommendations of [RFC2277], and makes converting to URIs easy. In recommendations of [RFC2277], and makes converting to URIs easy. In
any case, the API or protocol definition must clearly define the any case, the API or protocol definition must clearly define the
character encoding to be used. character encoding to be used.
The transfer from URI-only to IRI-capable components requires no The transfer from URI-only to IRI-capable components requires no
mapping, although the conversion described in Section 3.7 above may mapping, although the conversion described in Section 4 above may be
be performed. It is preferable not to perform this inverse performed. It is preferable not to perform this inverse conversion
conversion unless it is certain this can be done correctly. unless it is certain this can be done correctly.
7.2. URI/IRI Entry 7.2. URI/IRI Entry
Some components allow users to enter URIs into the system by typing Some components allow users to enter URIs into the system by typing
or dictation, for example. This software must be updated to allow or dictation, for example. This software must be updated to allow
for IRI entry. for IRI entry.
A person viewing a visual representation of an IRI (as a sequence of A person viewing a visual presentation of an IRI (as a sequence of
glyphs, in some order, in some visual display) or hearing an IRI will glyphs, in some order, in some visual display) will use an entry
use an entry method for characters in the user's language to input method for characters in the user's language to input the IRI.
the IRI. Depending on the script and the input method used, this may Depending on the script and the input method used, this may be a more
be a more or less complicated process. or less complicated process.
The process of IRI entry must ensure, as much as possible, that the The process of IRI entry must ensure, as much as possible, that the
restrictions defined in Section 2.2 are met. This may be done by restrictions defined in Section 2.2 are met. This may be done by
choosing appropriate input methods or variants/settings thereof, by choosing appropriate input methods or variants/settings thereof, by
appropriately converting the characters being input, by eliminating appropriately converting the characters being input, by eliminating
characters that cannot be converted, and/or by issuing a warning or characters that cannot be converted, and/or by issuing a warning or
error message to the user. error message to the user.
As an example of variant settings, input method editors for East As an example of variant settings, input method editors for East
Asian Languages usually allow the input of Latin letters and related Asian Languages usually allow the input of Latin letters and related
skipping to change at page 30, line 17 skipping to change at page 30, line 29
encoding for file names will make the transition to IRIs easier. encoding for file names will make the transition to IRIs easier.
Likewise, when a new Web form is set up using UTF-8 as the character Likewise, when a new Web form is set up using UTF-8 as the character
encoding of the form page, the returned query URIs will use UTF-8 as encoding of the form page, the returned query URIs will use UTF-8 as
the character encoding (unless the user, for whatever reason, changes the character encoding (unless the user, for whatever reason, changes
the character encoding) and will therefore be compatible with IRIs. the character encoding) and will therefore be compatible with IRIs.
These recommendations, when taken together, will allow for the These recommendations, when taken together, will allow for the
extension from URIs to IRIs in order to handle characters other than extension from URIs to IRIs in order to handle characters other than
US-ASCII while minimizing interoperability problems. For US-ASCII while minimizing interoperability problems. For
considerations regarding the upgrade of URI scheme definitions, see considerations regarding the upgrade of URI scheme definitions, see
Section 4.4. Section 5.4.
8. IANA Considerations 8. IANA Considerations
RFC Editor and IANA note: Please Replace RFC XXXX with the number of RFC Editor and IANA note: Please Replace RFC XXXX with the number of
this document when it issues as an RFC. this document when it issues as an RFC.
IANA maintains a registry of "URI schemes". A "URI scheme" also IANA maintains a registry of "URI schemes". A "URI scheme" also
serves an "IRI scheme". serves an "IRI scheme".
To clarify that the URI scheme registration process also applies to To clarify that the URI scheme registration process also applies to
skipping to change at page 31, line 30 skipping to change at page 31, line 41
Confusion can occur in various IRI components, such as the domain Confusion can occur in various IRI components, such as the domain
name part or the path part, or between IRI components. For name part or the path part, or between IRI components. For
considerations specific to the domain name part, see [RFC5890]. For considerations specific to the domain name part, see [RFC5890]. For
considerations specific to particular protocols or schemes, see the considerations specific to particular protocols or schemes, see the
security sections of the relevant specifications and registration security sections of the relevant specifications and registration
templates. Administrators of sites that allow independent users to templates. Administrators of sites that allow independent users to
create resources in the same sub area have to be careful. Details create resources in the same sub area have to be careful. Details
are discussed in Section 7.5. are discussed in Section 7.5.
The characters additionally allowed in Legacy Extended IRIs introduce The characters additionally allowed in Legacy Extended IRIs introduce
additional security issues. For details, see Section 6. additional security issues. For details, see Section 6.3.
10. Acknowledgements 10. Acknowledgements
This document was derived from [RFC3987]; the acknowledgments from This document was derived from [RFC3987]; the acknowledgments from
that specification still apply. that specification still apply.
In addition, this document was influenced by contributions from (in In addition, this document was influenced by contributions from (in
no particular order)Norman Walsh, Richard Tobin, Henry S. Thomson, no particular order)Norman Walsh, Richard Tobin, Henry S. Thomson,
John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris
Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank
skipping to change at page 38, line 31 skipping to change at page 38, line 43
Markup Languages", Unicode Technical Report #20, World Markup Languages", Unicode Technical Report #20, World
Wide Web Consortium Note, June 2003, Wide Web Consortium Note, June 2003,
<http://www.w3.org/TR/unicode-xml/>. <http://www.w3.org/TR/unicode-xml/>.
[UTR36] Davis, M. and M. Suignard, "Unicode Security [UTR36] Davis, M. and M. Suignard, "Unicode Security
Considerations", Unicode Technical Report #36, Considerations", Unicode Technical Report #36,
August 2010, <http://unicode.org/reports/tr36/>. August 2010, <http://unicode.org/reports/tr36/>.
[XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking [XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking
Language (XLink) Version 1.0", World Wide Web Language (XLink) Version 1.0", World Wide Web
Consortium Recommendation, June 2001, Consortium REC-xlink-20010627, June 2001,
<http://www.w3.org/TR/xlink/#link-locators>. <http://www.w3.org/TR/xlink/#link-locators>.
[XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth
Edition)", World Wide Web Consortium Recommendation, Edition)", World Wide Web Consortium REC-xml-20081126,
August 2006, <http://www.w3.org/TR/REC-xml>. August 2006, <http://www.w3.org/TR/REC-xml>.
[XMLNamespace] [XMLNamespace]
Bray, T., Hollander, D., Layman, A., and R. Tobin, Bray, T., Hollander, D., Layman, A., and R. Tobin,
"Namespaces in XML (Second Edition)", World Wide Web "Namespaces in XML (Second Edition)", World Wide Web
Consortium Recommendation, August 2006, Consortium REC-xml-names-20091208, August 2006,
<http://www.w3.org/TR/REC-xml-names>. <http://www.w3.org/TR/REC-xml-names>.
[XMLSchema] [XMLSchema]
Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
World Wide Web Consortium Recommendation, May 2001, World Wide Web Consortium REC-xmlschema-2-20041028,
<http://www.w3.org/TR/xmlschema-2/#anyURI>. May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.
[XPointer] [XPointer]
Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
Framework", World Wide Web Consortium Recommendation, Framework", World Wide Web Consortium REC-xptr-framework-
March 2003, 20030325, March 2003,
<http://www.w3.org/TR/xptr-framework/#escaping>. <http://www.w3.org/TR/xptr-framework/#escaping>.
Authors' Addresses Authors' Addresses
Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
possible, for example as "D&#252;rst" in XML and HTML) possible, for example as "D&#252;rst" in XML and HTML.)
Aoyama Gakuin University Aoyama Gakuin University
5-10-1 Fuchinobe 5-10-1 Fuchinobe
Sagamihara, Kanagawa 229-8558 Sagamihara, Kanagawa 229-8558
Japan Japan
Phone: +81 42 759 6329 Phone: +81 42 759 6329
Fax: +81 42 759 6495 Fax: +81 42 759 6495
Email: duerst@it.aoyama.ac.jp Email: duerst@it.aoyama.ac.jp
URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
(Note: This is the percent-encoded form of an IRI) (Note: This is the percent-encoded form of an IRI.)
Michel Suignard Michel Suignard
Unicode Consortium Unicode Consortium
P.O. Box 391476 P.O. Box 391476
Mountain View, CA 94039-1476 Mountain View, CA 94039-1476
U.S.A. U.S.A.
Phone: +1-650-693-3921 Phone: +1-650-693-3921
Email: michel@unicode.org Email: michel@unicode.org
URI: http://www.suignard.com URI: http://www.suignard.com
 End of changes. 66 change blocks. 
178 lines changed or deleted 183 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/