draft-ietf-iri-3987bis-06.txt   draft-ietf-iri-3987bis-07.txt 
Internationalized Resource Identifiers M. Duerst Internationalized Resource Identifiers M. Duerst
(iri) Aoyama Gakuin University (iri) Aoyama Gakuin University
Internet-Draft M. Suignard Internet-Draft M. Suignard
Obsoletes: 3987 (if approved) Unicode Consortium Obsoletes: 3987 (if approved) Unicode Consortium
Intended status: Standards Track L. Masinter Intended status: Standards Track L. Masinter
Expires: February 13, 2012 Adobe Expires: February 15, 2012 Adobe
August 12, 2011 August 14, 2011
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-ietf-iri-3987bis-06 draft-ietf-iri-3987bis-07
Abstract Abstract
This document defines the Internationalized Resource Identifier (IRI) This document defines the Internationalized Resource Identifier (IRI)
protocol element, as an extension of the Uniform Resource Identifier protocol element, as an extension of the Uniform Resource Identifier
(URI). An IRI is a sequence of characters from the Universal (URI). An IRI is a sequence of characters from the Universal
Character Set (Unicode/ISO 10646). Grammar and processing rules are Character Set (Unicode/ISO 10646). Grammar and processing rules are
given for IRIs and related syntactic forms. given for IRIs and related syntactic forms.
In addition, this document provides named additional rule sets for In addition, this document provides named additional rule sets for
skipping to change at page 1, line 39 skipping to change at page 1, line 39
extending the definition of URI) allows independent orderly extending the definition of URI) allows independent orderly
transitions: other protocols and languages that use URIs must transitions: other protocols and languages that use URIs must
explicitly choose to allow IRIs. explicitly choose to allow IRIs.
Guidelines are provided for the use and deployment of IRIs and Guidelines are provided for the use and deployment of IRIs and
related protocol elements when revising protocols, formats, and related protocol elements when revising protocols, formats, and
software components that currently deal only with URIs. software components that currently deal only with URIs.
RFC Editor: Please remove the next paragraph before publication. RFC Editor: Please remove the next paragraph before publication.
This document is intended to update RFC 3987 and move towards IETF This (and several companion documents) are intended to obsolete RFC
Draft Standard. For discussion and comments on this draft, please 3987, and also move towards IETF Draft Standard. For discussion and
join the IETF IRI WG by subscribing to the mailing list comments on these drafts, please join the IETF IRI WG by subscribing
public-iri@w3.org. For a list of open issues, please see the issue to the mailing list public-iri@w3.org, archives at
tracker of the WG at http://trac.tools.ietf.org/wg/iri/trac/report/1. http://lists.w3.org/archives/public/public-iri/. For a list of open
For a list of individual edits, please see the change history at issues, please see the issue tracker of the WG at
http://trac.tools.ietf.org/wg/iri/trac/report/1. For a list of
individual edits, please see the change history at
http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis. http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis.
Status of this Memo Status of this Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
skipping to change at page 2, line 16 skipping to change at page 2, line 17
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 13, 2012. This Internet-Draft will expire on February 15, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 11 skipping to change at page 3, line 11
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other it for publication as an RFC or to translate it into languages other
than English. than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5 1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5
1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6
1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7
1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 8
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 10 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 9
2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10
3. Processing IRIs and related protocol elements . . . . . . . . 13 3. Processing IRIs and related protocol elements . . . . . . . . 12
3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 14 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 13
3.2. Parse the IRI into IRI components . . . . . . . . . . . . 14 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 13
3.3. General percent-encoding of IRI components . . . . . . . 15 3.3. General percent-encoding of IRI components . . . . . . . 14
3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 15 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14
3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 15 3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 14
3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 16 3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 14
3.4.3. Additional Considerations . . . . . . . . . . . . . . 16 3.4.3. Additional Considerations . . . . . . . . . . . . . . 15
3.5. Mapping query components . . . . . . . . . . . . . . . . 17 3.5. Mapping query components . . . . . . . . . . . . . . . . 16
3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 17 3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 16
3.7. Converting URIs to IRIs . . . . . . . . . . . . . . . . . 17 3.7. Converting URIs to IRIs . . . . . . . . . . . . . . . . . 16
3.7.1. Examples . . . . . . . . . . . . . . . . . . . . . . . 19 3.7.1. Examples . . . . . . . . . . . . . . . . . . . . . . . 18
4. Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 20 4. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1. Logical Storage and Visual Presentation . . . . . . . . . 21 4.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 19
4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 22 4.2. Software Interfaces and Protocols . . . . . . . . . . . . 20
4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 23 4.3. Format of URIs and IRIs in Documents and Protocols . . . 20
4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4. Use of UTF-8 for Encoding Original Characters . . . . . . 20
5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.5. Relative IRI References . . . . . . . . . . . . . . . . . 22
5.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 25 5. Liberal Handling of Otherwise Invalid IRIs . . . . . . . . . . 22
5.2. Software Interfaces and Protocols . . . . . . . . . . . . 26 5.1. LEIRI Processing . . . . . . . . . . . . . . . . . . . . 22
5.3. Format of URIs and IRIs in Documents and Protocols . . . 26 6. Characters Not Allowed in IRIs . . . . . . . . . . . . . . . . 23
5.4. Use of UTF-8 for Encoding Original Characters . . . . . . 26 7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 25
5.5. Relative IRI References . . . . . . . . . . . . . . . . . 28 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 25
6. Liberal Handling of Otherwise Invalid IRIs . . . . . . . . . . 28 7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 25
6.1. LEIRI Processing . . . . . . . . . . . . . . . . . . . . 29 7.3. URI/IRI Transfer between Applications . . . . . . . . . . 26
6.2. Web Address Processing . . . . . . . . . . . . . . . . . 29 7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 27
6.3. Characters Not Allowed in IRIs . . . . . . . . . . . . . 31 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27
7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 33 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 28
7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 33 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 28
7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 33 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 29
7.3. URI/IRI Transfer between Applications . . . . . . . . . . 34 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30
7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34 9. Security Considerations . . . . . . . . . . . . . . . . . . . 30
7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 35 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31
7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 36 11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 32
7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 36 11.1. Split out Bidi, processing guidelines, comparison
7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 37 sections . . . . . . . . . . . . . . . . . . . . . . . . 32
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 11.2. Major restructuring of IRI processing model . . . . . . . 32
9. Security Considerations . . . . . . . . . . . . . . . . . . . 38 11.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 32
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39 11.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 32
11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 40 11.2.3. Extension of Syntax . . . . . . . . . . . . . . . . . 33
11.1. Major restructuring of IRI processing model . . . . . . . 40 11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 33
11.1.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 40 11.3. Change Log . . . . . . . . . . . . . . . . . . . . . . . 33
11.1.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 40 11.3.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 33
11.1.3. Extension of Syntax . . . . . . . . . . . . . . . . . 41 11.3.2. Changes from draft-duerst-iri-bis-07 to
11.1.4. More to be added . . . . . . . . . . . . . . . . . . . 41 draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 33
11.2. Change Log . . . . . . . . . . . . . . . . . . . . . . . 41 11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 33
11.2.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 41 11.4. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 33
11.2.2. Changes from draft-duerst-iri-bis-07 to 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 34
draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 41 11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 34
11.2.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 41 11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 34
11.3. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 41 11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 34
11.4. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 42 11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 34
11.5. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 42 11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 34
11.6. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 42 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 35
11.7. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 42 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35
11.8. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 42 12.1. Normative References . . . . . . . . . . . . . . . . . . 35
11.9. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 42 12.2. Informative References . . . . . . . . . . . . . . . . . 36
11.10. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 43 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 38
12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 43
12.1. Normative References . . . . . . . . . . . . . . . . . . 43
12.2. Informative References . . . . . . . . . . . . . . . . . 44
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 46
1. Introduction 1. Introduction
1.1. Overview and Motivation 1.1. Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFC3986] as a A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
sequence of characters chosen from a limited subset of the repertoire sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters. of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of The characters in URIs are frequently used for representing words of
skipping to change at page 5, line 46 skipping to change at page 5, line 46
[RFC3986]. [RFC3986].
This document defines the protocol element called Internationalized This document defines the protocol element called Internationalized
Resource Identifier (IRI), which allow applications of URIs to be Resource Identifier (IRI), which allow applications of URIs to be
extended to use resource identifiers that have a much wider extended to use resource identifiers that have a much wider
repertoire of characters. It also provides corresponding repertoire of characters. It also provides corresponding
"internationalized" versions of other constructs from [RFC3986], such "internationalized" versions of other constructs from [RFC3986], such
as URI references. The syntax of IRIs is defined in Section 2. as URI references. The syntax of IRIs is defined in Section 2.
Using characters outside of A - Z in IRIs adds a number of Using characters outside of A - Z in IRIs adds a number of
difficulties. Section 4 discusses the special case of bidirectional difficulties. Section 4 discusses the use of IRIs in different
IRIs using characters from scripts written right-to-left. Section 5 situations. Section 7 gives additional informative guidelines.
discusses the use of IRIs in different situations. Section 7 gives Section 9 discusses IRI-specific security considerations.
additional informative guidelines. Section 9 discusses IRI-specific
security considerations. [Bidi] discusses the special case of bidirectional IRIs using
characters from scripts written right-to-left. [Equivalence] gives
guidelines for applications wishing to determine if two IRIs are
equivalent, as well as defining some equivalence methods.
[RFC4395bis] updates the URI scheme registration guidelines and
proceedures to note that every URI scheme is also automatically an
IRI scheme and to allow scheme definitions to be directly described
in terms of Unicode characters.
When originally defining IRIs, several design alternatives were When originally defining IRIs, several design alternatives were
considered. Historically interested readers can find an overview in considered. Historically interested readers can find an overview in
Appendix A of [RFC3987]. For some additional background on the Appendix A of [RFC3987]. For some additional background on the
design of URIs and IRIs, please also see [Gettys]. design of URIs and IRIs, please also see [Gettys].
1.2. Applicability 1.2. Applicability
IRIs are designed to allow protocols and software that deal with URIs IRIs are designed to allow protocols and software that deal with URIs
to be updated to handle IRIs. A "URI scheme" (as defined by to be updated to handle IRIs. Processing of IRIs is accomplished by
[RFC3986] and registered through the IANA process defined in extending the URI syntax while retaining (and not expanding) the set
[RFC4395bis] also serves as an "IRI scheme". Processing of IRIs is of "reserved" characters, such that the syntax for any URI scheme may
accomplished by extending the URI syntax while retaining (and not be extended to allow non-ASCII characters. In addition, following
expanding) the set of "reserved" characters, such that the syntax for parsing of an IRI, it is possible to construct a corresponding URI by
any URI scheme may be extended to allow non-ASCII characters. In first encoding characters outside of the allowed URI range and then
addition, following parsing of an IRI, it is possible to construct a reassembling the components.
corresponding URI by first encoding characters outside of the allowed
URI range and then reassembling the components.
Practical use of IRIs forms in place of URIs forms depends on the Practical use of IRIs forms in place of URIs forms depends on the
following conditions being met: following conditions being met:
a. A protocol or format element MUST be explicitly designated to be a. A protocol or format element MUST be explicitly designated to be
able to carry IRIs. The intent is to avoid introducing IRIs into able to carry IRIs. The intent is to avoid introducing IRIs into
contexts that are not defined to accept them. For example, XML contexts that are not defined to accept them. For example, XML
schema [XMLSchema] has an explicit type "anyURI" that includes schema [XMLSchema] has an explicit type "anyURI" that includes
IRIs and IRI references. Therefore, IRIs and IRI references can IRIs and IRI references. Therefore, IRIs and IRI references can
be in attributes and elements of type "anyURI". On the other be in attributes and elements of type "anyURI". On the other
skipping to change at page 6, line 49 skipping to change at page 7, line 6
c. The URI scheme definition, if it explicitly allows a percent sign c. The URI scheme definition, if it explicitly allows a percent sign
("%") in any syntactic component, SHOULD define the interpretation ("%") in any syntactic component, SHOULD define the interpretation
of sequences of percent-encoded octets (using "%XX" hex octets) as of sequences of percent-encoded octets (using "%XX" hex octets) as
octet from sequences of UTF-8 encoded strings; this is recommended octet from sequences of UTF-8 encoded strings; this is recommended
in the guidelines for registering new schemes, [RFC4395bis]. For in the guidelines for registering new schemes, [RFC4395bis]. For
example, this is the practice for IMAP URLs [RFC2192], POP URLs example, this is the practice for IMAP URLs [RFC2192], POP URLs
[RFC2384] and the URN syntax [RFC2141]). Note that use of [RFC2384] and the URN syntax [RFC2141]). Note that use of
percent-encoding may also be restricted in some situations, for percent-encoding may also be restricted in some situations, for
example, URI schemes that disallow percent-encoding might still be example, URI schemes that disallow percent-encoding might still be
used with a fragment identifier which is percent-encoded (e.g., used with a fragment identifier which is percent-encoded (e.g.,
[XPointer]). See Section 5.4 for further discussion. [XPointer]). See Section 4.4 for further discussion.
1.3. Definitions 1.3. Definitions
The following definitions are used in this document; they follow the The following definitions are used in this document; they follow the
terms in [RFC2130], [RFC2277], and [ISO10646]. terms in [RFC2130], [RFC2277], and [ISO10646].
character: A member of a set of elements used for the organization, character: A member of a set of elements used for the organization,
control, or representation of data. For example, "LATIN CAPITAL control, or representation of data. For example, "LATIN CAPITAL
LETTER A" names a character. LETTER A" names a character.
skipping to change at page 7, line 43 skipping to change at page 7, line 46
ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6]. ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6].
IRI reference: Denotes the common usage of an Internationalized IRI reference: Denotes the common usage of an Internationalized
Resource Identifier. An IRI reference may be absolute or Resource Identifier. An IRI reference may be absolute or
relative. However, the "IRI" that results from such a reference relative. However, the "IRI" that results from such a reference
only includes absolute IRIs; any relative IRI references are only includes absolute IRIs; any relative IRI references are
resolved to their absolute form. Note that in [RFC2396] URIs did resolved to their absolute form. Note that in [RFC2396] URIs did
not include fragment identifiers, but in [RFC3986] fragment not include fragment identifiers, but in [RFC3986] fragment
identifiers are part of URIs. identifiers are part of URIs.
URL: The term "URL" was originally used [RFC1738] for roughly what
is now called a "URI". Books, software and documentation often
refers to URIs and IRIs using the "URL" term. Some usages
restrict "URL" to those URIs which are not URNs. Because of the
ambiguity of the term using the term "URL" is NOT RECOMMENDED in
formal documents.
LEIRI (Legacy Extended IRI) processing: This term was used in LEIRI (Legacy Extended IRI) processing: This term was used in
various XML specifications to refer to strings that, although not various XML specifications to refer to strings that, although not
valid IRIs, were acceptable input to the processing rules in valid IRIs, were acceptable input to the processing rules in
Section 6.1. Section 5.1.
(Web Address, Hypertext Reference, HREF): These terms have been
added in this document for convenience, to allow other
specifications to refer to those strings that, although not valid
IRIs, are acceptable input to the processing rules in Section 6.2.
This usage corresponds to the parsing rules of some popular web
browsing applications. ISSUE: Need to find a good name/
abbreviation for these.
running text: Human text (paragraphs, sentences, phrases) with running text: Human text (paragraphs, sentences, phrases) with
syntax according to orthographic conventions of a natural syntax according to orthographic conventions of a natural
language, as opposed to syntax defined for ease of processing by language, as opposed to syntax defined for ease of processing by
machines (e.g., markup, programming languages). machines (e.g., markup, programming languages).
protocol element: Any portion of a message that affects processing protocol element: Any portion of a message that affects processing
of that message by the protocol in question. of that message by the protocol in question.
presentation element: A presentation form corresponding to a presentation element: A presentation form corresponding to a
skipping to change at page 9, line 19 skipping to change at page 8, line 52
1.4. Notation 1.4. Notation
RFCs and Internet Drafts currently do not allow any characters RFCs and Internet Drafts currently do not allow any characters
outside the US-ASCII repertoire. Therefore, this document uses outside the US-ASCII repertoire. Therefore, this document uses
various special notations to denote such characters in examples. various special notations to denote such characters in examples.
In text, characters outside US-ASCII are sometimes referenced by In text, characters outside US-ASCII are sometimes referenced by
using a prefix of 'U+', followed by four to six hexadecimal digits. using a prefix of 'U+', followed by four to six hexadecimal digits.
To represent characters outside US-ASCII in examples, this document To represent characters outside US-ASCII in examples, this document
uses two notations: 'XML Notation' and 'Bidi Notation'. uses 'XML Notation'.
XML Notation uses a leading '&#x', a trailing ';', and the XML Notation uses a leading '&#x', a trailing ';', and the
hexadecimal number of the character in the UCS in between. For hexadecimal number of the character in the UCS in between. For
example, я stands for CYRILLIC CAPITAL LETTER YA. In this example, я stands for CYRILLIC CAPITAL LETTER YA. In this
notation, an actual '&' is denoted by '&'. notation, an actual '&' is denoted by '&'.
Bidi Notation is used for bidirectional examples: Lower case letters
stand for Latin letters or other letters that are written left to
right, whereas upper case letters represent Arabic or Hebrew letters
that are written right to left.
To denote actual octets in examples (as opposed to percent-encoded To denote actual octets in examples (as opposed to percent-encoded
octets), the two hex digits denoting the octet are enclosed in "<" octets), the two hex digits denoting the octet are enclosed in "<"
and ">". For example, the octet often denoted as 0xc9 is denoted and ">". For example, the octet often denoted as 0xc9 is denoted
here as <c9>. here as <c9>.
In this document, the key words "MUST", "MUST NOT", "REQUIRED", In this document, the key words "MUST", "MUST NOT", "REQUIRED",
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
and "OPTIONAL" are to be interpreted as described in [RFC2119]. and "OPTIONAL" are to be interpreted as described in [RFC2119].
2. IRI Syntax 2. IRI Syntax
skipping to change at page 10, line 14 skipping to change at page 9, line 41
the containing protocol or document ensures that the characters in the containing protocol or document ensures that the characters in
the IRI can be handled (e.g., searched, converted, displayed) in the the IRI can be handled (e.g., searched, converted, displayed) in the
same way as the rest of the protocol or document. same way as the rest of the protocol or document.
2.1. Summary of IRI Syntax 2.1. Summary of IRI Syntax
The IRI syntax extends the URI syntax in [RFC3986] by extending the The IRI syntax extends the URI syntax in [RFC3986] by extending the
class of unreserved characters, primarily by adding the characters of class of unreserved characters, primarily by adding the characters of
the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject
to the limitations given in the syntax rules below and in to the limitations given in the syntax rules below and in
Section 5.1. Section 4.1.
The syntax and use of components and reserved characters is the same The syntax and use of components and reserved characters is the same
as that in [RFC3986]. Each "URI scheme" thus also functions as an as that in [RFC3986]. Each "URI scheme" thus also functions as an
"IRI scheme", in that scheme-specific parsing rules for URIs of a "IRI scheme", in that scheme-specific parsing rules for URIs of a
scheme are be extended to allow parsing of IRIs using the same scheme are be extended to allow parsing of IRIs using the same
parsing rules. parsing rules.
All the operations defined in [RFC3986], such as the resolution of All the operations defined in [RFC3986], such as the resolution of
relative references, can be applied to IRIs by IRI-processing relative references, can be applied to IRIs by IRI-processing
software in exactly the same way as they are for URIs by URI- software in exactly the same way as they are for URIs by URI-
skipping to change at page 14, line 46 skipping to change at page 14, line 5
represented independent of any character encoding) represent the IRI represented independent of any character encoding) represent the IRI
as a sequence of characters from the UCS normalized according to as a sequence of characters from the UCS normalized according to
Unicode Normalization Form C (NFC, [UTR15]). Unicode Normalization Form C (NFC, [UTR15]).
3.2. Parse the IRI into IRI components 3.2. Parse the IRI into IRI components
Parse the IRI, either as a relative reference (no scheme) or using Parse the IRI, either as a relative reference (no scheme) or using
scheme specific processing (according to the scheme given); the scheme specific processing (according to the scheme given); the
result is a set of parsed IRI components. result is a set of parsed IRI components.
NOTE: The result of parsing into components will correspond to
subtrings of the IRI that may be accessible via an API. For example,
in [HTML5], the protocol components of interest are SCHEME (scheme),
HOST (ireg-name), PORT (port), the PATH (ipath after the initial
"/"), QUERY (iquery), FRAGMENT (ifragment), and AUTHORITY
(iauthority).
Subsequent processing rules are sometimes used to define other
syntactic components. For example, [HTML5] defines APIs for IRI
processing; in these APIs:
HOSTSPECIFIC the substring that follows the substring matched by the
iauthority production, or the whole string if the iauthority
production wasn't matched.
HOSTPORT if there is a scheme component and a port component and the
port given by the port component is different than the default
port defined for the protocol given by the scheme component, then
HOSTPORT is the substring that starts with the substring matched
by the host production and ends with the substring matched by the
port production, and includes the colon in between the two.
Otherwise, it is the same as the host component.
3.3. General percent-encoding of IRI components 3.3. General percent-encoding of IRI components
Except as noted in the following subsections, IRI components are Except as noted in the following subsections, IRI components are
mapped to the equivalent URI components by percent-encoding those mapped to the equivalent URI components by percent-encoding those
characters not allowed in URIs. Previous processing steps will have characters not allowed in URIs. Previous processing steps will have
removed some characters, and the interpretation of reserved removed some characters, and the interpretation of reserved
characters will have already been done (with the syntactic reserved characters will have already been done (with the syntactic reserved
characters outside of the IRI component). This mapping is defined characters outside of the IRI component). This mapping is defined
for all sequences of Unicode characters, whether or not they are for all sequences of Unicode characters, whether or not they are
valid for the component in question. valid for the component in question.
skipping to change at page 17, line 20 skipping to change at page 16, line 7
when arbitrary content is included in some part of a URI.) For when arbitrary content is included in some part of a URI.) For
example, an IRI of example, an IRI of
"http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is
converted to converted to
"http://www.example.org/red%09ros%C3%A9#red", not to something "http://www.example.org/red%09ros%C3%A9#red", not to something
like like
"http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red". "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
3.5. Mapping query components 3.5. Mapping query components
((NOTE: SEE ISSUES LIST)) For compatibility with existing deployed For compatibility with existing deployed HTTP infrastructure, the
HTTP infrastructure, the following special case applies for schemes following special case applies for schemes "http" and "https" and
"http" and "https" and IRIs whose origin has a document charset other IRIs whose origin has a document charset other than one which is UCS-
than one which is UCS-based (e.g., UTF-8 or UTF-16). In such a case, based (e.g., UTF-8 or UTF-16). In such a case, the "query" component
the "query" component of an IRI is mapped into a URI by using the of an IRI is mapped into a URI by using the document charset rather
document charset rather than UTF-8 as the binary representation than UTF-8 as the binary representation before pct-encoding. This
before pct-encoding. This mapping is not applied for any other mapping is not applied for any other scheme or component.
scheme or component.
3.6. Mapping IRIs to URIs 3.6. Mapping IRIs to URIs
The canonical mapping from a IRI to URI is defined by applying the The mapping from an IRI to URI is accomplished by applying the
mapping above (from IRI to URI components) and then reassembling a mapping above (from IRI to URI components) and then reassembling a
URI from the parsed URI components using the original punctuation URI from the parsed URI components using the original punctuation
that delimited the IRI components. that delimited the IRI components.
3.7. Converting URIs to IRIs 3.7. Converting URIs to IRIs
In some situations, for presentation and further processing, it is In some situations, for presentation and further processing, it is
desirable to convert a URI into an equivalent IRI in which natural desirable to convert a URI into an equivalent IRI in which natural
characters are represented directly rather than percent encoded. Of characters are represented directly rather than percent encoded. Of
course, every URI is already an IRI in its own right without any course, every URI is already an IRI in its own right without any
skipping to change at page 18, line 21 skipping to change at page 17, line 6
2. Some percent-encodings cannot be interpreted as sequences of UTF-8 2. Some percent-encodings cannot be interpreted as sequences of UTF-8
octets. octets.
(Note: The octet patterns of UTF-8 are highly regular. Therefore, (Note: The octet patterns of UTF-8 are highly regular. Therefore,
there is a very high probability, but no guarantee, that percent- there is a very high probability, but no guarantee, that percent-
encodings that can be interpreted as sequences of UTF-8 octets encodings that can be interpreted as sequences of UTF-8 octets
actually originated from UTF-8. For a detailed discussion, see actually originated from UTF-8. For a detailed discussion, see
[Duerst97].) [Duerst97].)
3. The conversion may result in a character that is not appropriate 3. The conversion may result in a character that is not appropriate
in an IRI. See Section 2.2, Section 4.1, and Section 5.1 for in an IRI. See Section 2.2, and Section 4.1 for further details.
further details.
4. IRI to URI conversion has different rules for dealing with domain 4. IRI to URI conversion has different rules for dealing with domain
names and query parameters. names and query parameters.
Conversion from a URI to an IRI MAY be done by using the following Conversion from a URI to an IRI MAY be done by using the following
steps: steps:
1. Represent the URI as a sequence of octets in US-ASCII. 1. Represent the URI as a sequence of octets in US-ASCII.
2. Convert all percent-encodings ("%" followed by two hexadecimal 2. Convert all percent-encodings ("%" followed by two hexadecimal
digits) to the corresponding octets, except those corresponding to digits) to the corresponding octets, except those corresponding to
"%", characters in "reserved", and characters in US-ASCII not "%", characters in "reserved", and characters in US-ASCII not
allowed in URIs. allowed in URIs.
3. Re-percent-encode any octet produced in step 2 that is not part of 3. Re-percent-encode any octet produced in step 2 that is not part of
a strictly legal UTF-8 octet sequence. a strictly legal UTF-8 octet sequence.
4. Re-percent-encode all octets produced in step 3 that in UTF-8 4. Re-percent-encode all octets produced in step 3 that in UTF-8
represent characters that are not appropriate according to represent characters that are not appropriate according to
Section 2.2, Section 4.1, and Section 5.1. Section 2.2 and Section 4.1.
5. Interpret the resulting octet sequence as a sequence of characters 5. Interpret the resulting octet sequence as a sequence of characters
encoded in UTF-8. encoded in UTF-8.
6. URIs known to contain domain names in the reg-name component 6. URIs known to contain domain names in the reg-name component
SHOULD convert punycode-encoded domain name labels to the SHOULD convert punycode-encoded domain name labels to the
corresponding characters using the ToUnicode procedure. corresponding characters using the ToUnicode procedure.
This procedure will convert as many percent-encoded characters as This procedure will convert as many percent-encoded characters as
possible to characters in an IRI. Because there are some choices possible to characters in an IRI. Because there are some choices
when step 4 is applied (see Section 5.1), results may vary. when step 4 is applied (see Section 4.1), results may vary.
Conversions from URIs to IRIs MUST NOT use any character encoding Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was guess from the context that another character encoding than UTF-8 was
used in the URI. For example, the URI used in the URI. For example, the URI
"http://www.example.org/r%E9sum%E9.html" might with some guessing be "http://www.example.org/r%E9sum%E9.html" might with some guessing be
interpreted to contain two e-acute characters encoded as iso-8859-1. interpreted to contain two e-acute characters encoded as iso-8859-1.
It must not be converted to an IRI containing these e-acute It must not be converted to an IRI containing these e-acute
characters. Otherwise, in the future the IRI will be mapped to characters. Otherwise, in the future the IRI will be mapped to
"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
skipping to change at page 20, line 19 skipping to change at page 18, line 51
3. http://www.example.org/D%FCrst 3. http://www.example.org/D%FCrst
4. http://www.example.org/D%FCrst 4. http://www.example.org/D%FCrst
5. http://www.example.org/D%FCrst 5. http://www.example.org/D%FCrst
6. http://www.example.org/D%FCrst 6. http://www.example.org/D%FCrst
The following example contains "%e2%80%ae", which is the percent- The following example contains "%e2%80%ae", which is the percent-
encoded encoded
UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. The
Section 4.1 forbids the direct use of this character in an IRI. direct use of this character is forbiddin in an IRI. Therefore, the
Therefore, the corresponding octets are re-percent-encoded in step 4. corresponding octets are re-percent-encoded in step 4. This example
This example shows that the case (upper- or lowercase) of letters shows that the case (upper- or lowercase) of letters used in percent-
used in percent-encodings may not be preserved. The example also encodings may not be preserved. The example also contains a
contains a punycode-encoded domain name label (xn--99zt52a), which is punycode-encoded domain name label (xn--99zt52a), which is not
not converted. converted.
1. http://xn--99zt52a.example.org/%e2%80%ae 1. http://xn--99zt52a.example.org/%e2%80%ae
2. http://xn--99zt52a.example.org/<e2><80><ae> 2. http://xn--99zt52a.example.org/<e2><80><ae>
3. http://xn--99zt52a.example.org/<e2><80><ae> 3. http://xn--99zt52a.example.org/<e2><80><ae>
4. http://xn--99zt52a.example.org/%E2%80%AE 4. http://xn--99zt52a.example.org/%E2%80%AE
5. http://xn--99zt52a.example.org/%E2%80%AE 5. http://xn--99zt52a.example.org/%E2%80%AE
6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE 6. http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE
Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46 Note that the label "xn--99zt52a" is converted to U+7D0D U+8C46
(Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this (Japanese Natto). ((EDITOR NOTE: There is some inconsistency in this
note.)) note.))
4. Bidirectional IRIs for Right-to-Left Languages 4. Use of IRIs
Some UCS characters, such as those used in the Arabic and Hebrew
scripts, have an inherent right-to-left (rtl) writing direction.
IRIs containing these characters (called bidirectional IRIs or Bidi
IRIs) require additional attention because of the non-trivial
relation between logical representation (used for digital
representation and for reading/spelling) and visual representation
(used for display/printing).
Because of the complex interaction between the logical
representation, the visual representation, and the syntax of a Bidi
IRI, a balance is needed between various requirements. The main
requirements are
1. user-predictable conversion between visual and logical
representation;
2. the ability to include a wide range of characters in various parts
of the IRI; and
3. minor or no changes or restrictions for implementations.
4.1. Logical Storage and Visual Presentation
When stored or transmitted in digital representation, bidirectional
IRIs MUST be in full logical order and MUST conform to the IRI syntax
rules (which includes the rules relevant to their scheme). This
ensures that bidirectional IRIs can be processed in the same way as
other IRIs.
Bidirectional IRIs MUST be rendered by using the Unicode
Bidirectional Algorithm [UNIV6], [UNI9]. Bidirectional IRIs MUST be
rendered in the same way as they would be if they were in a left-to-
right embedding; i.e., as if they were preceded by U+202A, LEFT-TO-
RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL
FORMATTING (PDF). Setting the embedding direction can also be done
in a higher-level protocol (e.g., the dir='ltr' attribute in HTML).
There is no requirement to use the above embedding if the display is
still the same without the embedding. For example, a bidirectional
IRI in a text with left-to-right base directionality (such as used
for English or Cyrillic) that is preceded and followed by whitespace
and strong left-to-right characters does not need an embedding.
Also, a bidirectional relative IRI reference that only contains
strong right-to-left characters and weak characters and that starts
and ends with a strong right-to-left character and appears in a text
with right-to-left base directionality (such as used for Arabic or
Hebrew) and is preceded and followed by whitespace and strong
characters does not need an embedding.
In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
sufficient to force the correct display behavior. However, the
details of the Unicode Bidirectional algorithm are not always easy to
understand. Implementers are strongly advised to err on the side of
caution and to use embedding in all cases where they are not
completely sure that the display behavior is unaffected without the
embedding.
The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits
higher-level protocols to influence bidirectional rendering. Such
changes by higher-level protocols MUST NOT be used if they change the
rendering of IRIs.
The bidirectional formatting characters that may be used before or
after the IRI to ensure correct display are not themselves part of
the IRI. IRIs MUST NOT contain bidirectional formatting characters
(LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual
rendering of the IRI but do not appear themselves. It would
therefore not be possible to input an IRI with such characters
correctly.
4.2. Bidi IRI Structure
The Unicode Bidirectional Algorithm is designed mainly for running
text. To make sure that it does not affect the rendering of
bidirectional IRIs too much, some restrictions on bidirectional IRIs
are necessary. These restrictions are given in terms of delimiters
(structural characters, mostly punctuation such as "@", ".", ":", and
"/") and components (usually consisting mostly of letters and
digits).
The following syntax rules from Section 2.2 correspond to components
for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment,
isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment.
Specifications that define the syntax of any of the above components
MAY divide them further and define smaller parts to be components
according to this document. As an example, the restrictions of
[RFC3490] on bidirectional domain names correspond to treating each
label of a domain name as a component for schemes with ireg-name as a
domain name. Even where the components are not defined formally, it
may be helpful to think about some syntax in terms of components and
to apply the relevant restrictions. For example, for the usual name/
value syntax in query parts, it is convenient to treat each name and
each value as a component. As another example, the extensions in a
resource name can be treated as separate components.
For each component, the following restrictions apply:
1. A component SHOULD NOT use both right-to-left and left-to-right
characters.
2. A component using right-to-left characters SHOULD start and end
with right-to-left characters.
The above restrictions are given as "SHOULD"s, rather than as
"MUST"s. For IRIs that are never presented visually, they are not
relevant. However, for IRIs in general, they are very important to
ensure consistent conversion between visual presentation and logical
representation, in both directions.
Note: In some components, the above restrictions may actually be
strictly enforced. For example, [RFC3490] requires that these
restrictions apply to the labels of a host name for those schemes
where ireg-name is a host name. In some other components (for
example, path components) following these restrictions may not be
too difficult. For other components, such as parts of the query
part, it may be very difficult to enforce the restrictions because
the values of query parameters may be arbitrary character
sequences.
If the above restrictions cannot be satisfied otherwise, the affected
component can always be mapped to URI notation as described in
Section 3.3. Please note that the whole component has to be mapped
(see also Example 9 below).
4.3. Input of Bidi IRIs
Bidi input methods MUST generate Bidi IRIs in logical order while
rendering them according to Section 4.1. During input, rendering
SHOULD be updated after every new character is input to avoid end-
user confusion.
4.4. Examples
This section gives examples of bidirectional IRIs, in Bidi Notation.
It shows legal IRIs with the relationship between logical and visual
representation and explains how certain phenomena in this
relationship may look strange to somebody not familiar with
bidirectional behavior, but familiar to users of Arabic and Hebrew.
It also shows what happens if the restrictions given in Section 4.2
are not followed. The examples below can be seen at [BidiEx], in
Arabic, Hebrew, and Bidi Notation variants.
To read the bidi text in the examples, read the visual representation
from left to right until you encounter a block of rtl text. Read the
rtl block (including slashes and other special characters) from right
to left, then continue at the next unread ltr character.
Example 1: A single component with rtl characters is inverted:
Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html"
Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html"
Components can be read one by one, and each component can be read in
its natural direction.
Example 2: More than one consecutive component with rtl characters is
inverted as a whole:
Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html"
Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html"
A sequence of rtl components is read rtl, in the same way as a
sequence of rtl words is read rtl in a bidi text.
Example 3: All components of an IRI (except for the scheme) are rtl.
All rtl components are inverted overall:
Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"
Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"
The whole IRI (except the scheme) is read rtl. Delimiters between
rtl components stay between the respective components; delimiters
between ltr and rtl components don't move.
Example 4: Each of several sequences of rtl components is inverted on
its own:
Logical representation: "http://AB.CD.ef/gh/IJ/KL.html"
Visual representation: "http://DC.BA.ef/gh/LK/JI.html"
Each sequence of rtl components is read rtl, in the same way as each
sequence of rtl words in an ltr text is read rtl.
Example 5: Example 2, applied to components of different kinds:
Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
Visual representation: "http://ab.cd.HG/FE/ij/kl.html"
The inversion of the domain name label and the path component may be
unexpected, but it is consistent with other bidi behavior. For
reassurance that the domain component really is "ab.cd.EF", it may be
helpful to read aloud the visual representation following the bidi
algorithm. After "http://ab.cd." one reads the RTL block
"E-F-slash-G-H", which corresponds to the logical representation.
Example 6: Same as Example 5, with more rtl components:
Logical representation: "http://ab.CD.EF/GH/IJ/kl.html"
Visual representation: "http://ab.JI/HG/FE.DC/kl.html"
The inversion of the domain name labels and the path components may
be easier to identify because the delimiters also move.
Example 7: A single rtl component includes digits:
Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"
Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"
Numbers are written ltr in all cases but are treated as an additional
embedding inside a run of rtl characters. This is completely
consistent with usual bidirectional text.
Example 8 (not allowed): Numbers are at the start or end of an rtl
component:
Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html"
Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html"
The sequence "1/2" is interpreted by the bidi algorithm as a
fraction, fragmenting the components and leading to confusion. There
are other characters that are interpreted in a special way close to
numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":".
Example 9 (not allowed): The numbers in the previous example are
percent-encoded:
Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html",
Visual representation: "http://ab.cd.ef/LK/JI%32/%31HG.html"
Example 10 (allowed but not recommended):
Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html"
Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html"
Components consisting of only numbers are allowed (it would be rather
difficult to prohibit them), but these may interact with adjacent RTL
components in ways that are not easy to predict.
Example 11 (allowed but not recommended):
Logical representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"
Visual representation: "http://ab.123.HGFEDCij/kl/mn/op.html"
Components consisting of numbers and left-to-right characters are
allowed, but these may interact with adjacent RTL components in ways
that are not easy to predict.
5. Use of IRIs
5.1. Limitations on UCS Characters Allowed in IRIs 4.1. Limitations on UCS Characters Allowed in IRIs
This section discusses limitations on characters and character This section discusses limitations on characters and character
sequences usable for IRIs beyond those given in Section 2.2 and sequences usable for IRIs beyond those given in Section 2.2. The
Section 4.1. The considerations in this section are relevant when considerations in this section are relevant when IRIs are created and
IRIs are created and when URIs are converted to IRIs. when URIs are converted to IRIs.
a. The repertoire of characters allowed in each IRI component is a. The repertoire of characters allowed in each IRI component is
limited by the definition of that component. For example, the limited by the definition of that component. For example, the
definition of the scheme component does not allow characters definition of the scheme component does not allow characters
beyond US-ASCII. beyond US-ASCII.
(Note: In accordance with URI practice, generic IRI software (Note: In accordance with URI practice, generic IRI software
cannot and should not check for such limitations.) cannot and should not check for such limitations.)
b. The UCS contains many areas of characters for which there are b. The UCS contains many areas of characters for which there are
skipping to change at page 26, line 12 skipping to change at page 20, line 10
the full-width equivalents of Latin characters, half-width the full-width equivalents of Latin characters, half-width
Katakana characters for Japanese, and many others. It also Katakana characters for Japanese, and many others. It also
includes many look-alikes of "space", "delims", and "unwise", includes many look-alikes of "space", "delims", and "unwise",
characters excluded in [RFC3491]. characters excluded in [RFC3491].
Additional information is available from [UNIXML]. [UNIXML] is Additional information is available from [UNIXML]. [UNIXML] is
written in the context of running text rather than in that of written in the context of running text rather than in that of
identifiers. Nevertheless, it discusses many of the categories of identifiers. Nevertheless, it discusses many of the categories of
characters not appropriate for IRIs. characters not appropriate for IRIs.
5.2. Software Interfaces and Protocols 4.2. Software Interfaces and Protocols
Although an IRI is defined as a sequence of characters, software Although an IRI is defined as a sequence of characters, software
interfaces for URIs typically function on sequences of octets or interfaces for URIs typically function on sequences of octets or
other kinds of code units. Thus, software interfaces and protocols other kinds of code units. Thus, software interfaces and protocols
MUST define which character encoding is used. MUST define which character encoding is used.
Intermediate software interfaces between IRI-capable components and Intermediate software interfaces between IRI-capable components and
URI-only components MUST map the IRIs per Section 3.6, when URI-only components MUST map the IRIs per Section 3.6, when
transferring from IRI-capable to URI-only components. This mapping transferring from IRI-capable to URI-only components. This mapping
SHOULD be applied as late as possible. It SHOULD NOT be applied SHOULD be applied as late as possible. It SHOULD NOT be applied
between components that are known to be able to handle IRIs. between components that are known to be able to handle IRIs.
5.3. Format of URIs and IRIs in Documents and Protocols 4.3. Format of URIs and IRIs in Documents and Protocols
Document formats that transport URIs may have to be upgraded to allow Document formats that transport URIs may have to be upgraded to allow
the transport of IRIs. In cases where the document as a whole has a the transport of IRIs. In cases where the document as a whole has a
native character encoding, IRIs MUST also be encoded in this native character encoding, IRIs MUST also be encoded in this
character encoding and converted accordingly by a parser or character encoding and converted accordingly by a parser or
interpreter. IRI characters not expressible in the native character interpreter. IRI characters not expressible in the native character
encoding SHOULD be escaped by using the escaping conventions of the encoding SHOULD be escaped by using the escaping conventions of the
document format if such conventions are available. Alternatively, document format if such conventions are available. Alternatively,
they MAY be percent-encoded according to Section 3.6. For example, they MAY be percent-encoded according to Section 3.6. For example,
in HTML or XML, numeric character references SHOULD be used. If a in HTML or XML, numeric character references SHOULD be used. If a
skipping to change at page 26, line 48 skipping to change at page 20, line 46
the document in the UTF-8 character encoding. the document in the UTF-8 character encoding.
((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs, ((UPDATE THIS NOTE)) Note: Some formats already accommodate IRIs,
although they use different terminology. HTML 4.0 [HTML4] defines although they use different terminology. HTML 4.0 [HTML4] defines
the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0 the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0
[XML1], XLink [XLink], XML Schema [XMLSchema], and specifications [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications
based upon them allow IRIs. Also, it is expected that all relevant based upon them allow IRIs. Also, it is expected that all relevant
new W3C formats and protocols will be required to handle IRIs new W3C formats and protocols will be required to handle IRIs
[CharMod]. [CharMod].
5.4. Use of UTF-8 for Encoding Original Characters 4.4. Use of UTF-8 for Encoding Original Characters
This section discusses details and gives examples for point c) in This section discusses details and gives examples for point c) in
Section 1.2. To be able to use IRIs, the URI corresponding to the Section 1.2. To be able to use IRIs, the URI corresponding to the
IRI in question has to encode original characters into octets by IRI in question has to encode original characters into octets by
using UTF-8. This can be specified for all URIs of a URI scheme or using UTF-8. This can be specified for all URIs of a URI scheme or
can apply to individual URIs for schemes that do not specify how to can apply to individual URIs for schemes that do not specify how to
encode original characters. It can apply to the whole URI, or only encode original characters. It can apply to the whole URI, or only
to some part. For background information on encoding characters into to some part. For background information on encoding characters into
URIs, see also Section 2.5 of [RFC3986]. URIs, see also Section 2.5 of [RFC3986].
skipping to change at page 28, line 30 skipping to change at page 22, line 27
document name is encoded in iso-8859-1 based on server settings, but document name is encoded in iso-8859-1 based on server settings, but
where the fragment identifier is encoded in UTF-8 according to where the fragment identifier is encoded in UTF-8 according to
[XPointer]. The IRI corresponding to the above URI would be (in XML [XPointer]. The IRI corresponding to the above URI would be (in XML
notation) notation)
"http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;". "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;".
Similar considerations apply to query parts. The functionality of Similar considerations apply to query parts. The functionality of
IRIs (namely, to be able to include non-ASCII characters) can only be IRIs (namely, to be able to include non-ASCII characters) can only be
used if the query part is encoded in UTF-8. used if the query part is encoded in UTF-8.
5.5. Relative IRI References 4.5. Relative IRI References
Processing of relative IRI references against a base is handled Processing of relative IRI references against a base is handled
straightforwardly; the algorithms of [RFC3986] can be applied straightforwardly; the algorithms of [RFC3986] can be applied
directly, treating the characters additionally allowed in IRI directly, treating the characters additionally allowed in IRI
references in the same way that unreserved characters are in URI references in the same way that unreserved characters are in URI
references. references.
6. Liberal Handling of Otherwise Invalid IRIs 5. Liberal Handling of Otherwise Invalid IRIs
(EDITOR NOTE: This Section may move to an appendix.) Some technical Some technical specifications and widely-deployed software have
specifications and widely-deployed software have allowed additional allowed additional variations and extensions of IRIs to be used in
variations and extensions of IRIs to be used in syntactic components. syntactic components.
This section describes two widely-used preprocessing agreements.
Other technical specifications may wish to reference a syntactic
component which is "a valid IRI or a string that will map to a valid
IRI after this preprocessing algorithm". These two variants are
known as Legacy Extended IRI or LEIRI [LEIRI], and Web Address
[HTML5]).
Future technical specifications SHOULD NOT allow conforming producers Future technical specifications SHOULD NOT allow conforming producers
to produce, or conforming content to contain, such forms, as they are to produce, or conforming content to contain, such forms, as they are
not interoperable with other IRI consuming software. not interoperable with other IRI consuming software.
6.1. LEIRI Processing 5.1. LEIRI Processing
This section defines Legacy Extended IRIs (LEIRIs). The syntax of This section defines Legacy Extended IRIs (LEIRIs). The syntax of
Legacy Extended IRIs is the same as that for <IRI-reference>, except Legacy Extended IRIs is the same as that for <IRI-reference>, except
that the ucschar production is replaced by the leiri-ucschar that the ucschar production is replaced by the leiri-ucschar
production: production:
leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|"
/ "\" / "^" / "`" / %x0-1F / %x7F-D7FF / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
/ %xE000-FFFD / %x10000-10FFFF / %xE000-FFFD / %x10000-10FFFF
Among other extensions, processors based on this specification also Among other extensions, processors based on this specification also
did not enforce the restriction on bidirectional formatting did not enforce the restriction on bidirectional formatting
characters in Section 4.1, and the iprivate production becomes characters in [Bidi], and the iprivate production becomes redundant.
redundant.
To convert a string allowed as a LEIRI to an IRI, each character To convert a string allowed as a LEIRI to an IRI, each character
allowed in leiri-ucschar but not in ucschar must be percent-encoded allowed in leiri-ucschar but not in ucschar must be percent-encoded
using Section 3.3. using Section 3.3.
6.2. Web Address Processing 6. Characters Not Allowed in IRIs
Many popular web browsers have taken the approach of being quite
liberal in what is accepted as a "URL" or its relative forms. This
section describes their behavior in terms of a preprocessor which
maps strings into the IRI space for subsequent parsing and
interpretation as an IRI.
In some situations, it might be appropriate to describe the syntax
that a liberal consumer implementation might accept as a "Web
Address" or "Hypertext Reference" or "HREF". However, technical
specifications SHOULD restrict the syntactic form allowed by
compliant producers to the IRI or IRI reference syntax defined in
this document even if they want to mandate this processing.
Summary:
o Leading and trailing whitespace is removed.
o Some additional characters are removed.
o Some additional characters are allowed and escaped (as with
LEIRI).
o If interpreting an IRI as a URI, the pct-encoding of the query
component of the parsed URI component depends on operational
context.
Each string provided may have an associated charset (called the HREF-
charset here); this defaults to UTF-8. For web browsers interpreting
HTML, the document charset of a string is determined:
If the string came from a script (e.g. as an argument to a method)
The HRef-charset is the script's charset.
If the string came from a DOM node (e.g. from an element) The node
has a Document, and the HRef-charset is the Document's character
encoding.
If the string had a HRef-charset defined when the string was created
or defined The HRef-charset is as defined.
If the resulting HRef-charset is a unicode based character encoding
(e.g., UTF-16), then use UTF-8 instead.
The syntax for Web Addresses is obtained by replacing the 'ucschar',
pct-form, path-sep, and ifragment rules with the href-ucschar, href-
pct-form, href-path-sep, and href-ifragment rules below. In
addition, some characters are stripped.
href-ucschar = " " / "<" / ">" / DQUOTE / "{" / "}" / "|"
/ "\" / "^" / "`" / %x0-1F / %x7F-D7FF
/ %xE000-FFFD / %x10000-10FFFF
href-pct-form = pct-encoded / "%"
href-path-sep = "/" / "\"
href-ifragment = *( ipchar / "/" / "?" / "#" ) ; adding "#"
href-strip = <to be done>
(NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT
SENTENCE) browsers did not enforce the restriction on bidirectional
formatting characters in Section 4.1, and the iprivate production
becomes redundant.
'Web Address processing' requires the following additional
preprocessing steps:
1. Leading and trailing instances of space (U+0020), CR (U+000A), LF
(U+000D), and TAB (U+0009) characters are removed.
2. strip all characters in href-strip.
3. Percent-encode all characters in href-ucschar not in ucschar.
4. Replace occurrences of "%" not followed by two hexadecimal digits
by "%25".
5. Convert backslashes ('\') matching href-path-sep to forward
slashes ('/').
6.3. Characters Not Allowed in IRIs
This section provides a list of the groups of characters and code This section provides a list of the groups of characters and code
points that are allowed by LEIRI or HREF but are not allowed in IRIs points that are allowed in some contexts but are not allowed in IRIs
or are allowed in IRIs only in the query part. For each group of or are allowed in IRIs only in the query part. For each group of
characters, advice on the usage of these characters is also given, characters, advice on the usage of these characters is also given,
concentrating on the reasons for why they are excluded from IRI use. concentrating on the reasons for why they are excluded from IRI use.
Space (U+0020): Some formats and applications use space as a Space (U+0020): Some formats and applications use space as a
delimiter, e.g. for items in a list. Appendix C of [RFC3986] also delimiter, e.g. for items in a list. Appendix C of [RFC3986] also
mentions that white space may have to be added when displaying or mentions that white space may have to be added when displaying or
printing long URIs; the same applies to long IRIs. This means printing long URIs; the same applies to long IRIs. This means
that spaces can disappear, or can make the what is intended as a that spaces can disappear, or can make the what is intended as a
single IRI or IRI reference to be treated as two or more separate single IRI or IRI reference to be treated as two or more separate
skipping to change at page 34, line 22 skipping to change at page 26, line 29
might allow the user to view an IRI as it is mapped to a URI. Places might allow the user to view an IRI as it is mapped to a URI. Places
where the input of IRIs is frequent may provide the possibility for where the input of IRIs is frequent may provide the possibility for
viewing an IRI as mapped to a URI. This will help users when some of viewing an IRI as mapped to a URI. This will help users when some of
the software they use does not yet accept IRIs. the software they use does not yet accept IRIs.
An IRI input component interfacing to components that handle URIs, An IRI input component interfacing to components that handle URIs,
but not IRIs, must map the IRI to a URI before passing it to these but not IRIs, must map the IRI to a URI before passing it to these
components. components.
For the input of IRIs with right-to-left characters, please see For the input of IRIs with right-to-left characters, please see
Section 4.3. [Bidi].
7.3. URI/IRI Transfer between Applications 7.3. URI/IRI Transfer between Applications
Many applications (for example, mail user agents) try to detect URIs Many applications (for example, mail user agents) try to detect URIs
appearing in plain text. For this, they use some heuristics based on appearing in plain text. For this, they use some heuristics based on
URI syntax. They then allow the user to click on such URIs and URI syntax. They then allow the user to click on such URIs and
retrieve the corresponding resource in an appropriate (usually retrieve the corresponding resource in an appropriate (usually
scheme-dependent) application. scheme-dependent) application.
Such applications would need to be upgraded, in order to use the IRI Such applications would need to be upgraded, in order to use the IRI
skipping to change at page 36, line 24 skipping to change at page 28, line 31
Greek, and Cyrillic, using lowercase letters results in fewer Greek, and Cyrillic, using lowercase letters results in fewer
ambiguities than using uppercase letters would. ambiguities than using uppercase letters would.
7.6. Display of URIs/IRIs 7.6. Display of URIs/IRIs
In situations where the rendering software is not expected to display In situations where the rendering software is not expected to display
non-ASCII parts of the IRI correctly using the available layout and non-ASCII parts of the IRI correctly using the available layout and
font resources, these parts should be percent-encoded before being font resources, these parts should be percent-encoded before being
displayed. displayed.
For display of Bidi IRIs, please see Section 4.1. For display of Bidi IRIs, please see [Bidi].
7.7. Interpretation of URIs and IRIs 7.7. Interpretation of URIs and IRIs
Software that interprets IRIs as the names of local resources should Software that interprets IRIs as the names of local resources should
accept IRIs in multiple forms and convert and match them with the accept IRIs in multiple forms and convert and match them with the
appropriate local resource names. appropriate local resource names.
First, multiple representations include both IRIs in the native First, multiple representations include both IRIs in the native
character encoding of the protocol and also their URI counterparts. character encoding of the protocol and also their URI counterparts.
skipping to change at page 38, line 7 skipping to change at page 30, line 15
encoding for file names will make the transition to IRIs easier. encoding for file names will make the transition to IRIs easier.
Likewise, when a new Web form is set up using UTF-8 as the character Likewise, when a new Web form is set up using UTF-8 as the character
encoding of the form page, the returned query URIs will use UTF-8 as encoding of the form page, the returned query URIs will use UTF-8 as
the character encoding (unless the user, for whatever reason, changes the character encoding (unless the user, for whatever reason, changes
the character encoding) and will therefore be compatible with IRIs. the character encoding) and will therefore be compatible with IRIs.
These recommendations, when taken together, will allow for the These recommendations, when taken together, will allow for the
extension from URIs to IRIs in order to handle characters other than extension from URIs to IRIs in order to handle characters other than
US-ASCII while minimizing interoperability problems. For US-ASCII while minimizing interoperability problems. For
considerations regarding the upgrade of URI scheme definitions, see considerations regarding the upgrade of URI scheme definitions, see
Section 5.4. Section 4.4.
8. IANA Considerations 8. IANA Considerations
RFC Editor and IANA note: Please Replace RFC XXXX with the number of RFC Editor and IANA note: Please Replace RFC XXXX with the number of
this document when it issues as an RFC. this document when it issues as an RFC.
IANA maintains a registry of "URI schemes". A "URI scheme" also IANA maintains a registry of "URI schemes". A "URI scheme" also
serves an "IRI scheme". serves an "IRI scheme".
To clarify that the URI scheme registration process also applies to To clarify that the URI scheme registration process also applies to
skipping to change at page 39, line 8 skipping to change at page 31, line 16
User agents SHOULD NOT rely on visual or perceptual comparison or User agents SHOULD NOT rely on visual or perceptual comparison or
verification of IRIs as a means of validating or assuring safety, verification of IRIs as a means of validating or assuring safety,
correctness or appropriateness of an IRI. Other means of presenting correctness or appropriateness of an IRI. Other means of presenting
users with the validity, safety, or appropriateness of visited sites users with the validity, safety, or appropriateness of visited sites
are being developed in the browser community as an alternative means are being developed in the browser community as an alternative means
of avoiding these difficulties. of avoiding these difficulties.
Besides the large character repertoire of Unicode, reasons for Besides the large character repertoire of Unicode, reasons for
confusion include different forms of normalization and different confusion include different forms of normalization and different
normalization expectations, use of percent-encoding with various normalization expectations, use of percent-encoding with various
legacy encodings, and bidirectionality issues. See also [UTR36]. legacy encodings, and bidirectionality issues. See also [Bidi].
Confusion can occur in various IRI components, such as the domain Confusion can occur in various IRI components, such as the domain
name part or the path part, or between IRI components. For name part or the path part, or between IRI components. For
considerations specific to the domain name part, see [RFC5890]. For considerations specific to the domain name part, see [RFC5890]. For
considerations specific to particular protocols or schemes, see the considerations specific to particular protocols or schemes, see the
security sections of the relevant specifications and registration security sections of the relevant specifications and registration
templates. Administrators of sites that allow independent users to templates. Administrators of sites that allow independent users to
create resources in the same sub area have to be careful. Details create resources in the same sub area have to be careful. Details
are discussed in Section 7.5. are discussed in Section 7.5.
Confusion can occur with bidirectional IRIs, if the restrictions in
Section 4.2 are not followed. The same visual representation may be
interpreted as different logical representations, and vice versa. It
is also very important that a correct Unicode bidirectional
implementation be used.
The characters additionally allowed in Legacy Extended IRIs introduce The characters additionally allowed in Legacy Extended IRIs introduce
additional security issues. For details, see Section 6.3. additional security issues. For details, see Section 6.
10. Acknowledgements 10. Acknowledgements
This document was derived from [RFC3987]; the acknowledgments from This document was derived from [RFC3987]; the acknowledgments from
that specification still apply. that specification still apply.
We would like to thank Ian Hickson, Michael Sperberg-McQueen, and Dan
Connolly for their work on HyperText References, and Norman Walsh,
Richard Tobin, Henry S. Thomson, John Cowan, Paul Grosso, and the XML
Core Working Group of the W3C for their work on LEIRIs.
In addition, this document was influenced by contributions from (in In addition, this document was influenced by contributions from (in
no particular order) Chris Lilley, Bjoern Hoehrmann, Felix Sasaki, no particular order)Norman Walsh, Richard Tobin, Henry S. Thomson,
Jeremy Carroll, Frank Ellermann, Michael Everson, Cary Karp, John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris
Matitiahu Allouche, Richard Ishida, Addison Phillips, Jonathan Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank
Rosenne, Najib Tounsi, Debbie Garside, Mark Davis, Sarmad Hussain, Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard
Ted Hardie, Konrad Lanz, Thomas Roessler, Lisa Dusseault, Julian Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie
Reschke, Giovanni Campagna, Anne van Kesteren, Mark Nottingham, Erik Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas
van der Poel, Marcin Hanclik, Marcos Caceres, Roy Fielding, Greg Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van
Wilkins, Pieter Hintjens, Daniel R. Tobias, Marko Martin, Maciej Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos
Stanchowiak, Wil Tan, Yui Naruse, Michael A. Puls II, Dave Thaler, Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R.
Tom Petch, John Klensin, Shawn Steele, Peter Saint-Andre, Geoffrey Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse,
Sneddon, Chris Weber, Alex Melnikov, Slim Amamou, S. Moonesamy, Tim Michael A. Puls II, Dave Thaler, Tom Petch, John Klensin, Shawn
Berners-Lee, Yaron Goland, Sam Ruby, Adam Barth, Abdulrahman I. Steele, Peter Saint-Andre, Geoffrey Sneddon, Chris Weber, Alex
ALGhadir, Aharon Lanin, Thomas Milo, Murray Sargent, Marc Blanchet, Melnikov, Slim Amamou, S. Moonesamy, Tim Berners-Lee, Yaron Goland,
and Mykyta Yevstifeyev. Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas
Milo, Murray Sargent, Marc Blanchet, and Mykyta Yevstifeyev.
11. Main Changes Since RFC 3987 11. Main Changes Since RFC 3987
This section describes the main changes since [RFC3987]. This section describes the main changes since [RFC3987].
11.1. Major restructuring of IRI processing model 11.1. Split out Bidi, processing guidelines, comparison sections
Move some components (comparison, bidi, processing) into separate
documents.
11.2. Major restructuring of IRI processing model
Major restructuring of IRI processing model to make scheme-specific Major restructuring of IRI processing model to make scheme-specific
translation necessary to handle IDNA requirements and for consistency translation necessary to handle IDNA requirements and for consistency
with web implementations. with web implementations.
Starting with IRI, you want one of: Starting with IRI, you want one of:
a IRI components (IRI parsed into UTF8 pieces) a IRI components (IRI parsed into UTF8 pieces)
b URI components (URI parsed into ASCII pieces, encoded correctly) b URI components (URI parsed into ASCII pieces, encoded correctly)
c whole URI (for passing on to some other system that wants whole c whole URI (for passing on to some other system that wants whole
URIs) URIs)
11.1.1. OLD WAY 11.2.1. OLD WAY
1. Pct-encoding on the whole thing to a URI. (c1) If you want a 1. Pct-encoding on the whole thing to a URI. (c1) If you want a
(maybe broken) whole URI, you might stop here. (maybe broken) whole URI, you might stop here.
2. Parsing the URI into URI components. (b1) If you want (maybe 2. Parsing the URI into URI components. (b1) If you want (maybe
broken) URI components, stop here. broken) URI components, stop here.
3. Decode the components (undoing the pct-encoding). (a) if you want 3. Decode the components (undoing the pct-encoding). (a) if you want
IRI components, stop here. IRI components, stop here.
4. reencode: Either using a different encoding some components (for 4. reencode: Either using a different encoding some components (for
domain names, and query components in web pages, which depends on domain names, and query components in web pages, which depends on
the component, scheme and context), and otherwise using pct- the component, scheme and context), and otherwise using pct-
encoding. (b2) if you want (good) URI components, stop here. encoding. (b2) if you want (good) URI components, stop here.
5. reassemble the reencoded components. (c2) if you want a (*good*) 5. reassemble the reencoded components. (c2) if you want a (*good*)
whole URI stop here. whole URI stop here.
11.1.2. NEW WAY 11.2.2. NEW WAY
1. Parse the IRI into IRI components using the generic syntax. (a) 1. Parse the IRI into IRI components using the generic syntax. (a)
if you want IRI components, stop here. if you want IRI components, stop here.
2. Encode each components, using pct-encoding, IDN encoding, or 2. Encode each components, using pct-encoding, IDN encoding, or
special query part encoding depending on the component scheme or special query part encoding depending on the component scheme or
context. (b) If you want URI components, stop here. context. (b) If you want URI components, stop here.
3. reassemble the a whole URI from URI components. (c) if you want a 3. reassemble the a whole URI from URI components. (c) if you want a
whole URI stop here. whole URI stop here.
11.1.3. Extension of Syntax 11.2.3. Extension of Syntax
Added the tag range (U+E0000-E0FFF) to the iprivate production. Some Added the tag range (U+E0000-E0FFF) to the iprivate production. Some
IRIs generated with the new syntax may fail to pass very strict IRIs generated with the new syntax may fail to pass very strict
checks relying on the old syntax. But characters in this range checks relying on the old syntax. But characters in this range
should be extremely infrequent anyway. should be extremely infrequent anyway.
11.1.4. More to be added 11.2.4. More to be added
TODO: There are more main changes that need to be documented in this TODO: There are more main changes that need to be documented in this
section. section.
11.2. Change Log 11.3. Change Log
Note to RFC Editor: Please completely remove this section before Note to RFC Editor: Please completely remove this section before
publication. publication.
11.2.1. Changes after draft-ietf-iri-3987bis-01 11.3.1. Changes after draft-ietf-iri-3987bis-01
Changes from draft-ietf-iri-3987bis-01 onwards are available as Changes from draft-ietf-iri-3987bis-01 onwards are available as
changesets in the IETF tools subversion repository at http:// changesets in the IETF tools subversion repository at http://
trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/ trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/
draft-ietf-iri-3987bis.xml. draft-ietf-iri-3987bis.xml.
11.2.2. Changes from draft-duerst-iri-bis-07 to 11.3.2. Changes from draft-duerst-iri-bis-07 to
draft-ietf-iri-3987bis-00 draft-ietf-iri-3987bis-00
Changed draft name, date, last paragraph of abstract, and titles in Changed draft name, date, last paragraph of abstract, and titles in
change log, and added this section in moving from change log, and added this section in moving from
draft-duerst-iri-bis-07 (personal submission) to draft-duerst-iri-bis-07 (personal submission) to
draft-ietf-iri-3987bis-00 (WG document). draft-ietf-iri-3987bis-00 (WG document).
11.2.3. Changes from -06 to -07 of draft-duerst-iri-bis 11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis
Major restructuring of the processing model, see Section 11.1. Major restructuring of the processing model, see Section 11.2.
11.3. Changes from -00 to -01 11.4. Changes from -00 to -01
o Removed 'mailto:' before mail addresses of authors. o Removed 'mailto:' before mail addresses of authors.
o Added "<to be done>" as right side of 'href-strip' rule. Fixed o Added "<to be done>" as right side of 'href-strip' rule. Fixed
'|' to '/' for alternatives. '|' to '/' for alternatives.
11.4. Changes from -05 to -06 of draft-duerst-iri-bis-00 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00
o Add HyperText Reference, change abstract, acks and references for o Add HyperText Reference, change abstract, acks and references for
it it
o Add Masinter back as another editor. o Add Masinter back as another editor.
o Masinter integrates HRef material from HTML5 spec. o Masinter integrates HRef material from HTML5 spec.
o Rewrite introduction sections to modernize. o Rewrite introduction sections to modernize.
11.5. Changes from -04 to -05 of draft-duerst-iri-bis 11.6. Changes from -04 to -05 of draft-duerst-iri-bis
o Updated references. o Updated references.
o Changed IPR text to pre5378Trust200902. o Changed IPR text to pre5378Trust200902.
11.6. Changes from -03 to -04 of draft-duerst-iri-bis 11.7. Changes from -03 to -04 of draft-duerst-iri-bis
o Added explicit abbreviation for LEIRIs. o Added explicit abbreviation for LEIRIs.
o Mentioned LEIRI references. o Mentioned LEIRI references.
o Completed text in LEIRI section about tag characters and about o Completed text in LEIRI section about tag characters and about
specials. specials.
11.7. Changes from -02 to -03 of draft-duerst-iri-bis 11.8. Changes from -02 to -03 of draft-duerst-iri-bis
o Updated some references. o Updated some references.
o Updated Michel Suginard's coordinates. o Updated Michel Suginard's coordinates.
11.8. Changes from -01 to -02 of draft-duerst-iri-bis 11.9. Changes from -01 to -02 of draft-duerst-iri-bis
o Added tag range to iprivate (issue private-include-tags-115). o Added tag range to iprivate (issue private-include-tags-115).
o Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs. o Added Specials (U+FFF0-FFFD) to Legacy Extended IRIs.
11.9. Changes from -00 to -01 of draft-duerst-iri-bis 11.10. Changes from -00 to -01 of draft-duerst-iri-bis
o Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI" o Changed from "IRIs with Spaces/Controls" to "Legacy Extended IRI"
based on input from the W3C XML Core WG. Moved the relevant based on input from the W3C XML Core WG. Moved the relevant
subsections to the back and promoted them to a section. subsections to the back and promoted them to a section.
o Added some text re. Legacy Extended IRIs to the security section. o Added some text re. Legacy Extended IRIs to the security section.
o Added a IANA Consideration Section. o Added a IANA Consideration Section.
o Added this Change Log Section. o Added this Change Log Section.
o Added a section about "IRIs with Spaces/Controls" (converting from o Added a section about "IRIs with Spaces/Controls" (converting from
a Note in RFC 3987). a Note in RFC 3987).
11.10. Changes from RFC 3987 to -00 of draft-duerst-iri-bis 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis
Fixed errata (see Fixed errata (see
http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987). http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3987).
12. References 12. References
12.1. Normative References 12.1. Normative References
[ASCII] American National Standards Institute, "Coded Character [ASCII] American National Standards Institute, "Coded Character
Set -- 7-bit American Standard Code for Information Set -- 7-bit American Standard Code for Information
skipping to change at page 44, line 9 skipping to change at page 36, line 13
[RFC5890] Klensin, J., "Internationalized Domain Names for [RFC5890] Klensin, J., "Internationalized Domain Names for
Applications (IDNA): Definitions and Document Framework", Applications (IDNA): Definitions and Document Framework",
RFC 5890, August 2010. RFC 5890, August 2010.
[RFC5891] Klensin, J., "Internationalized Domain Names in [RFC5891] Klensin, J., "Internationalized Domain Names in
Applications (IDNA): Protocol", RFC 5891, August 2010. Applications (IDNA): Protocol", RFC 5891, August 2010.
[STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax [STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", STD 68, RFC 5234, January 2008. Specifications: ABNF", STD 68, RFC 5234, January 2008.
[UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard
Annex #9, March 2004,
<http://www.unicode.org/reports/tr9/tr9-13.html>.
[UNIV6] The Unicode Consortium, "The Unicode Standard, Version [UNIV6] The Unicode Consortium, "The Unicode Standard, Version
6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011,
ISBN 978-1-936213-01-6)", October 2010. ISBN 978-1-936213-01-6)", October 2010.
[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms",
Unicode Standard Annex #15, March 2008, Unicode Standard Annex #15, March 2008,
<http://www.unicode.org/unicode/reports/tr15/ <http://www.unicode.org/unicode/reports/tr15/
tr15-23.html>. tr15-23.html>.
12.2. Informative References 12.2. Informative References
[BidiEx] "Examples of bidirectional IRIs", [Bidi] Duerst, M. and L. Masinter, "Guidelines for
<http://www.w3.org/International/iri-edit/BidiExamples>. Internationalized Resource Identifiers with Bi-directional
Characters (Bidi IRIs)", draft-ietf-iri-bidi-guidelines-00
(work in progress), August 2011.
[CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T. [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
Texin, "Character Model for the World Wide Web: Resource Texin, "Character Model for the World Wide Web: Resource
Identifiers", World Wide Web Consortium Candidate Identifiers", World Wide Web Consortium Candidate
Recommendation, November 2004, Recommendation, November 2004,
<http://www.w3.org/TR/charmod-resid>. <http://www.w3.org/TR/charmod-resid>.
[Duerst97] [Duerst97]
Duerst, M., "The Properties and Promises of UTF-8", Proc. Duerst, M., "The Properties and Promises of UTF-8", Proc.
11th International Unicode Conference, San Jose , 11th International Unicode Conference, San Jose ,
September 1997, <http://www.ifi.unizh.ch/mml/mduerst/ September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
papers/PDF/IUC11-UTF-8.pdf>. papers/PDF/IUC11-UTF-8.pdf>.
[Equivalence]
Masinter, L. and M. Duerst, "Equivalence and
Canonicalization of Internationalized Resource Identifiers
(IRIs)", draft-ietf-iri-comparison-00 (work in progress),
August 2011.
[Gettys] Gettys, J., "URI Model Consequences", [Gettys] Gettys, J., "URI Model Consequences",
<http://www.w3.org/DesignIssues/ModelConsequences>. <http://www.w3.org/DesignIssues/ModelConsequences>.
[HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
Specification", World Wide Web Consortium Recommendation, Specification", World Wide Web Consortium Recommendation,
December 1999, December 1999,
<http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>. <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.
[HTML5] Hickson, I. and D. Hyatt, "A vocabulary and associated
APIs for HTML and XHTML", World Wide Web
Consortium Working Draft, April 2009,
<http://www.w3.org/TR/2009/WD-html5-20090423/>.
[LEIRI] Thompson, H., Tobin, R., and N. Walsh, "Legacy extended [LEIRI] Thompson, H., Tobin, R., and N. Walsh, "Legacy extended
IRIs for XML resource identification", World Wide Web IRIs for XML resource identification", World Wide Web
Consortium Note, November 2008, Consortium Note, November 2008,
<http://www.w3.org/TR/leiri/>. <http://www.w3.org/TR/leiri/>.
[RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
Resource Locators (URL)", RFC 1738, December 1994.
[RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Extensions (MIME) Part One: Format of Internet Message
Bodies", RFC 2045, November 1996. Bodies", RFC 2045, November 1996.
[RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin, M., and P. Svanberg, "The Report of Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
the IAB Character Set Workshop held 29 February - 1 March, the IAB Character Set Workshop held 29 February - 1 March,
1996", RFC 2130, April 1997. 1996", RFC 2130, April 1997.
[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
skipping to change at page 46, line 5 skipping to change at page 38, line 5
[RFC2640] Curtin, B., "Internationalization of the File Transfer [RFC2640] Curtin, B., "Internationalization of the File Transfer
Protocol", RFC 2640, July 1999. Protocol", RFC 2640, July 1999.
[RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
Identifiers (IRIs)", RFC 3987, January 2005. Identifiers (IRIs)", RFC 3987, January 2005.
[RFC4395bis] [RFC4395bis]
Hansen, T., Hardie, T., and L. Masinter, "Guidelines and Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
Registration Procedures for New URI/IRI Schemes", Registration Procedures for New URI/IRI Schemes",
draft-hansen-iri-4395bis-irireg-00 (work in progress), draft-ietf-iri-4395bis-irireg-03 (work in progress),
September 2010. July 2011.
[RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on [RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
Encodings for Internationalized Domain Names", RFC 6055, Encodings for Internationalized Domain Names", RFC 6055,
February 2011. February 2011.
[UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other
Markup Languages", Unicode Technical Report #20, World Markup Languages", Unicode Technical Report #20, World
Wide Web Consortium Note, June 2003, Wide Web Consortium Note, June 2003,
<http://www.w3.org/TR/unicode-xml/>. <http://www.w3.org/TR/unicode-xml/>.
 End of changes. 68 change blocks. 
552 lines changed or deleted 185 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/