draft-ietf-iri-3987bis-10.txt   draft-ietf-iri-3987bis-11.txt 
Internationalized Resource Identifiers M. Duerst Internationalized Resource Identifiers M. Duerst
(iri) Aoyama Gakuin University (iri) Aoyama Gakuin University
Internet-Draft M. Suignard Internet-Draft M. Suignard
Obsoletes: 3987 (if approved) Unicode Consortium Obsoletes: 3987 (if approved) Unicode Consortium
Intended status: Standards Track L. Masinter Intended status: Standards Track L. Masinter
Expires: September 3, 2012 Adobe Expires: September 13, 2012 Adobe
March 2, 2012 March 12, 2012
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-ietf-iri-3987bis-10 draft-ietf-iri-3987bis-11
Abstract Abstract
This document defines the Internationalized Resource Identifier (IRI) This document defines the Internationalized Resource Identifier (IRI)
protocol element, as an extension of the Uniform Resource Identifier protocol element, as an extension of the Uniform Resource Identifier
(URI). An IRI is a sequence of characters from the Universal (URI). An IRI is a sequence of characters from the Universal
Character Set (Unicode/ISO 10646). Grammar and processing rules are Character Set (Unicode/ISO 10646). Grammar and processing rules are
given for IRIs and related syntactic forms. given for IRIs and related syntactic forms.
Defining IRI as new protocol element (rather than updating or Defining IRI as a new protocol element (rather than updating or
extending the definition of URI) allows independent orderly extending the definition of URI) allows independent orderly
transitions: other protocols and languages that use URIs must transitions: protocols and languages that use URIs must explicitly
explicitly choose to allow IRIs. choose to allow IRIs.
Guidelines are provided for the use and deployment of IRIs and Guidelines are provided for the use and deployment of IRIs and
related protocol elements when revising protocols, formats, and related protocol elements when revising protocols, formats, and
software components that currently deal only with URIs. software components that currently deal only with URIs.
This document is part of a set of documents intended to replace RFC This document is part of a set of documents intended to replace RFC
3987. 3987.
RFC Editor: Please remove the next paragraph before publication. RFC Editor: Please remove the next paragraph before publication.
skipping to change at page 2, line 15 skipping to change at page 2, line 15
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 3, 2012. This Internet-Draft will expire on September 13, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 4, line 26 skipping to change at page 4, line 26
11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 35 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 35
11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 35 11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 35
11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 35 11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 35
11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35 11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35
11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35 11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35
11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 36 11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 36
11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 36 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 36
12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36
12.1. Normative References . . . . . . . . . . . . . . . . . . 36 12.1. Normative References . . . . . . . . . . . . . . . . . . 36
12.2. Informative References . . . . . . . . . . . . . . . . . 37 12.2. Informative References . . . . . . . . . . . . . . . . . 37
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 40 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39
1. Introduction 1. Introduction
1.1. Overview and Motivation 1.1. Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFC3986] as a A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
sequence of characters chosen from a limited subset of the repertoire sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters. of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of The characters in URIs are frequently used for representing words of
natural languages. This usage has many advantages: Such URIs are natural languages. This usage has many advantages: Such URIs are
easier to memorize, easier to interpret, easier to transcribe, easier easier to memorize, easier to interpret, easier to transcribe, easier
to create, and easier to guess. For most languages other than to create, and easier to guess. For most languages other than
English, however, the natural script uses characters other than A - English, however, the natural script uses characters other than A -
Z. For many people, handling Latin characters is as difficult as Z. For many people, handling Latin characters is as difficult as
handling the characters of other scripts is for those who use only handling the characters of other scripts is for those who use only
the Latin alphabet. Many languages with non-Latin scripts are the Latin script. Many languages with non-Latin scripts are
transcribed with Latin letters. These transcriptions are now often transcribed with Latin letters. These transcriptions are now often
used in URIs, but they introduce additional difficulties. used in URIs, but they introduce additional difficulties.
The infrastructure for the appropriate handling of characters from The infrastructure for the appropriate handling of characters from
additional scripts is now widely deployed in operating system and additional scripts is now widely deployed in operating system and
application software. Software that can handle a wide variety of application software. Software that can handle a wide variety of
scripts and languages at the same time is increasingly common. Also, scripts and languages at the same time is increasingly common. Also,
an increasing number of protocols and formats can carry a wide range an increasing number of protocols and formats can carry a wide range
of characters. of characters.
URIs are composed out of a very limited repertoire of characters; URIs are composed out of a very limited repertoire of characters;
this design choice was made to support global transcription([RFC3986] this design choice was made to support global transcription (see
section 1.2.1.). Reliable transition between a URI (as an abstract [RFC3986] section 1.2.1.). Reliable transition between a URI (as an
protocol element composed of a sequence of characters) and a abstract protocol element composed of a sequence of characters) and a
presentation of that URI (written on a napkin, read out loud) and presentation of that URI (written on a napkin, read out loud) and
back is relatively straightforward, because of the limited repertoire back is relatively straightforward, because of the limited repertoire
of characters used. IRIs are designed to satisfy a different set of of characters used. IRIs are designed to satisfy a different set of
use requirements; in particular, to allow IRIs to be written in ways use requirements; in particular, to allow IRIs to be written in ways
that are more meaningful to their users, even at the expense of that are more meaningful to their users, even at the expense of
global transcribability. However, ensuring reliability of the global transcribability. However, ensuring reliability of the
transition between an IRI and its presentation and back is more transition between an IRI and its presentation and back is more
difficult and complex when dealing with the larger set of Unicode difficult and complex when dealing with the larger set of Unicode
characters. For example, Unicode supports multiple ways of encoding characters. For example, Unicode supports multiple ways of encoding
complex combinations of characters and accents, with multiple complex combinations of characters and accents, with multiple
character sequences that can result in the same presentation. character sequences that can result in the same presentation.
This document defines the protocol element called Internationalized This document defines the protocol element called Internationalized
Resource Identifier (IRI), which allow applications of URIs to be Resource Identifier (IRI), which allows applications of URIs to be
extended to use resource identifiers that have a much wider extended to use resource identifiers that have a much wider
repertoire of characters. It also provides corresponding repertoire of characters. It also provides corresponding
"internationalized" versions of other constructs from [RFC3986], such "internationalized" versions of other constructs from [RFC3986], such
as URI references. The syntax of IRIs is defined in Section 2. as URI references. The syntax of IRIs is defined in Section 2.
Within this document, Section 5 discusses the use of IRIs in Within this document, Section 5 discusses the use of IRIs in
different situations. Section 7 gives additional informative different situations. Section 7 gives additional informative
guidelines. Section 9 discusses IRI-specific security guidelines. Section 9 discusses IRI-specific security
considerations. considerations.
This specification is part of a collection of specifications intended This specification is part of a collection of specifications intended
to replace [RFC3987]. [Bidi] discusses the special case of to replace [RFC3987]. [Bidi] discusses the special case of
bidirectional IRIs using characters from scripts written right-to- bidirectional IRIs, IRIs using characters from scripts written right-
left. [Equivalence] gives guidelines for applications wishing to to-left. [Equivalence] gives guidelines for applications wishing to
determine if two IRIs are equivalent, as well as defining some determine if two IRIs are equivalent, as well as defining some
equivalence methods. [RFC4395bis] updates the URI scheme equivalence methods. [RFC4395bis] updates the URI scheme
registration guidelines and procedures to note that every URI scheme registration guidelines and procedures to note that every URI scheme
is also automatically an IRI scheme and to allow scheme definitions is also automatically an IRI scheme and to allow scheme definitions
to be directly described in terms of Unicode characters. to be directly described in terms of Unicode characters.
1.2. Applicability 1.2. Applicability
IRIs are designed to allow protocols and software that deal with URIs IRIs are designed to allow protocols and software that deal with URIs
to be updated to handle IRIs. Processing of IRIs is accomplished by to be updated to handle IRIs. Processing of IRIs is accomplished by
skipping to change at page 6, line 39 skipping to change at page 6, line 39
reassembling the components. reassembling the components.
Practical use of IRIs forms in place of URIs forms depends on the Practical use of IRIs forms in place of URIs forms depends on the
following conditions being met: following conditions being met:
a. A protocol or format element MUST be explicitly designated to be a. A protocol or format element MUST be explicitly designated to be
able to carry IRIs. The intent is to avoid introducing IRIs into able to carry IRIs. The intent is to avoid introducing IRIs into
contexts that are not defined to accept them. For example, XML contexts that are not defined to accept them. For example, XML
schema [XMLSchema] has an explicit type "anyURI" that includes schema [XMLSchema] has an explicit type "anyURI" that includes
IRIs and IRI references. Therefore, IRIs and IRI references can IRIs and IRI references. Therefore, IRIs and IRI references can
be in attributes and elements of type "anyURI". On the other be used in attributes and elements of type "anyURI". On the other
hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is hand, in HTTP/1.1 ([RFC2616]) , the Request URI is defined as a
defined as a URI, which means that direct use of IRIs is not URI, which means that direct use of IRIs is not allowed in HTTP
allowed in HTTP requests. requests.
b. The protocol or format carrying the IRIs MUST have a mechanism to b. The protocol or format carrying the IRIs MUST have a mechanism to
represent the wide range of characters used in IRIs, either represent the wide range of characters used in IRIs, either
natively or by some protocol- or format-specific escaping natively or by some protocol- or format-specific escaping
mechanism (for example, numeric character references in [XML1]). mechanism (for example, numeric character references in [XML1]).
c. The URI scheme definition, if it explicitly allows a percent sign c. The URI scheme definition, if it explicitly allows a percent sign
("%") in any syntactic component, SHOULD define the interpretation ("%") in any syntactic component, SHOULD define the interpretation
of sequences of percent-encoded octets (using "%XX" hex octets) as of sequences of percent-encoded octets (using "%XX" hex octets) as
octet from sequences of UTF-8 encoded strings; this is recommended octets from sequences of UTF-8 encoded characters; this is
in the guidelines for registering new schemes, [RFC4395bis]. For recommended in the guidelines for registering new schemes,
example, this is the practice for IMAP URLs [RFC2192], POP URLs [RFC4395bis]. For example, this is the practice for IMAP URLs
[RFC2384] and the URN syntax [RFC2141]). Note that use of [RFC2192], POP URLs [RFC2384] and the URN syntax [RFC2141]). Note
percent-encoding may also be restricted in some situations, for that use of percent-encoding may also be restricted in some
example, URI schemes that disallow percent-encoding might still be situations, for example, URI schemes that disallow percent-
used with a fragment identifier which is percent-encoded (e.g., encoding might still be used with a fragment identifier which is
[XPointer]). See Section 5.4 for further discussion. percent-encoded (e.g., [XPointer]). See Section 5.4 for further
discussion.
1.3. Definitions 1.3. Definitions
The following definitions are used in this document; they follow the The following definitions are used in this document; they follow the
terms in [RFC2130], [RFC2277], and [ISO10646]. terms in [RFC2130], [RFC2277], and [ISO10646].
character: A member of a set of elements used for the organization, character: A member of a set of elements used for the organization,
control, or representation of data. For example, "LATIN CAPITAL control, or representation of data. For example, "LATIN CAPITAL
LETTER A" names a character. LETTER A" names a character.
skipping to change at page 8, line 5 skipping to change at page 8, line 5
ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6]. ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6].
IRI reference: Denotes the common usage of an Internationalized IRI reference: Denotes the common usage of an Internationalized
Resource Identifier. An IRI reference may be absolute or Resource Identifier. An IRI reference may be absolute or
relative. However, the "IRI" that results from such a reference relative. However, the "IRI" that results from such a reference
only includes absolute IRIs; any relative IRI references are only includes absolute IRIs; any relative IRI references are
resolved to their absolute form. Note that in [RFC2396] URIs did resolved to their absolute form. Note that in [RFC2396] URIs did
not include fragment identifiers, but in [RFC3986] fragment not include fragment identifiers, but in [RFC3986] fragment
identifiers are part of URIs. identifiers are part of URIs.
LEIRI (Legacy Extended IRI) processing: This term was used in LEIRI (Legacy Extended IRI): This term is used in various XML
various XML specifications to refer to strings that, although not specifications to refer to strings that, although not valid IRIs,
valid IRIs, were acceptable input to the processing rules in are acceptable input to the processing rules in Section 6.2.
Section 6.2.
running text: Human text (paragraphs, sentences, phrases) with running text: Human text (paragraphs, sentences, phrases) with
syntax according to orthographic conventions of a natural syntax according to orthographic conventions of a natural
language, as opposed to syntax defined for ease of processing by language, as opposed to syntax defined for ease of processing by
machines (e.g., markup, programming languages). machines (e.g., markup, programming languages).
protocol element: Any portion of a message that affects processing protocol element: Any portion of a message that affects processing
of that message by the protocol in question. of that message by the protocol in question.
create (a URI or IRI): With respect to URIs and IRIs, the term is create (a URI or IRI): With respect to URIs and IRIs, the term is
skipping to change at page 8, line 36 skipping to change at page 8, line 35
parsed URI component: When a URI processor parses a URI (following parsed URI component: When a URI processor parses a URI (following
the generic syntax or a scheme-specific syntax, the result is a the generic syntax or a scheme-specific syntax, the result is a
set of parsed URI components, each of which has a type set of parsed URI components, each of which has a type
(corresponding to the syntactic definition) and a sequence of URI (corresponding to the syntactic definition) and a sequence of URI
characters. characters.
parsed IRI component: When an IRI processor parses an IRI directly, parsed IRI component: When an IRI processor parses an IRI directly,
following the general syntax or a scheme-specific syntax, the following the general syntax or a scheme-specific syntax, the
result is a set of parsed IRI components, each of which has a type result is a set of parsed IRI components, each of which has a type
(corresponding to the syntactice definition) and a sequence of IRI (corresponding to the syntactic definition) and a sequence of IRI
characters. (This definition is analogous to "parsed URI characters. (This definition is analogous to "parsed URI
component".) component".)
IRI scheme: A URI scheme may also be known as an "IRI scheme" if the IRI scheme: A URI scheme may also be known as an "IRI scheme" if the
scheme's syntax has been extended to allow non-US-ASCII characters scheme's syntax has been extended to allow non-US-ASCII characters
according to the rules in this document. according to the rules in this document.
1.4. Notation 1.4. Notation
RFCs and Internet Drafts currently do not allow any characters RFCs and Internet Drafts currently do not allow any characters
outside the US-ASCII repertoire. Therefore, this document uses outside the US-ASCII repertoire. Therefore, this document uses
various special notations to denote such characters in examples. various special notations for such characters in examples.
In text, characters outside US-ASCII are sometimes referenced by In text, characters outside US-ASCII are sometimes referenced by
using a prefix of 'U+', followed by four to six hexadecimal digits. using a prefix of 'U+', followed by four to six hexadecimal digits.
To represent characters outside US-ASCII in examples, this document To represent characters outside US-ASCII in examples, this document
uses 'XML Notation'. uses 'XML Notation'.
XML Notation uses a leading '&#x', a trailing ';', and the XML Notation uses a leading '&#x', a trailing ';', and the
hexadecimal number of the character in the UCS in between. For hexadecimal number of the character in the UCS in between. For
example, я stands for CYRILLIC CAPITAL LETTER YA. In this example, я stands for CYRILLIC CAPITAL LETTER YA. In this
skipping to change at page 9, line 47 skipping to change at page 9, line 45
2.1. Summary of IRI Syntax 2.1. Summary of IRI Syntax
The IRI syntax extends the URI syntax in [RFC3986] by extending the The IRI syntax extends the URI syntax in [RFC3986] by extending the
class of unreserved characters, primarily by adding the characters of class of unreserved characters, primarily by adding the characters of
the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject
to the limitations given in the syntax rules below and in to the limitations given in the syntax rules below and in
Section 5.1. Section 5.1.
The syntax and use of components and reserved characters is the same The syntax and use of components and reserved characters is the same
as that in [RFC3986]. Each "URI scheme" thus also functions as an as that in [RFC3986]. Each URI scheme thus also functions as an IRI
"IRI scheme", in that scheme-specific parsing rules for URIs of a scheme, in that scheme-specific parsing rules for URIs of a scheme
scheme are be extended to allow parsing of IRIs using the same are extended to allow parsing of IRIs using the same parsing rules.
parsing rules.
All the operations defined in [RFC3986], such as the resolution of All the operations defined in [RFC3986], such as the resolution of
relative references, can be applied to IRIs by IRI-processing relative references, can be applied to IRIs by IRI-processing
software in exactly the same way as they are for URIs by URI- software in exactly the same way as they are for URIs by URI-
processing software. processing software.
Characters outside the US-ASCII repertoire MUST NOT be reserved and Characters outside the US-ASCII repertoire MUST NOT be reserved and
therefore MUST NOT be used for syntactical purposes, such as to therefore MUST NOT be used for syntactical purposes, such as to
delimit components in newly defined schemes. For example, U+00A2, delimit components in newly defined schemes. For example, U+00A2,
CENT SIGN, is not allowed as a delimiter in IRIs, because it is in CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
skipping to change at page 14, line 19 skipping to change at page 14, line 19
characters not allowed in URIs. Previous processing steps will have characters not allowed in URIs. Previous processing steps will have
removed some characters, and the interpretation of reserved removed some characters, and the interpretation of reserved
characters will have already been done (with the syntactic reserved characters will have already been done (with the syntactic reserved
characters outside of the IRI component). This mapping is defined characters outside of the IRI component). This mapping is defined
for all sequences of Unicode characters, whether or not they are for all sequences of Unicode characters, whether or not they are
valid for the component in question. valid for the component in question.
For each character which is not allowed anywhere in a valid URI apply For each character which is not allowed anywhere in a valid URI apply
the following steps. the following steps.
Convert to UTF-8 Convert the character to a sequence of one or more Convert to UTF-8: Convert the character to a sequence of one or more
octets using UTF-8 [RFC3629]. octets using UTF-8 [RFC3629].
Percent encode Convert each octet of this sequence to %HH, where HH Percent encode: Convert each octet of this sequence to %HH, where HH
is the hexadecimal notation of the octet value. The hexadecimal is the hexadecimal notation of the octet value. The hexadecimal
notation SHOULD use uppercase letters. (This is the general URI notation SHOULD use uppercase letters. (This is the general URI
percent-encoding mechanism in Section 2.1 of [RFC3986].) percent-encoding mechanism in Section 2.1 of [RFC3986].)
Note that the mapping is an identity transformation for parsed URI Note that the mapping is an identity transformation for parsed URI
components of valid URIs, and is idempotent: applying the mapping a components of valid URIs, and is idempotent: applying the mapping a
second time will not change anything. second time will not change anything.
3.4. Mapping ireg-name 3.4. Mapping ireg-name
skipping to change at page 15, line 30 skipping to change at page 15, line 30
on each dot-separated label, and by using U+002E (FULL STOP) as a on each dot-separated label, and by using U+002E (FULL STOP) as a
label separator. This procedure may fail, but this would mean that label separator. This procedure may fail, but this would mean that
the IRI cannot be resolved. In such cases, if the domain name the IRI cannot be resolved. In such cases, if the domain name
conversion fails, then the entire IRI conversion fails. Processors conversion fails, then the entire IRI conversion fails. Processors
that have no mechanism for signalling a failure MAY instead that have no mechanism for signalling a failure MAY instead
substitute an otherwise invalid host name, although such processing substitute an otherwise invalid host name, although such processing
SHOULD be avoided. SHOULD be avoided.
For example, the IRI For example, the IRI
"http://résumé.example.org" "http://résumé.example.org"
MAY be converted to is converted to
"http://xn--rsum-bad.example.org" "http://xn--rsum-bad.example.org".
.
This conversion for ireg-name will be better able to deal with legacy This conversion for ireg-name will be better able to deal with legacy
infrastructure that cannot handle percent-encoding in domain names. infrastructure that cannot handle percent-encoding in domain names.
3.4.3. Additional Considerations 3.4.3. Additional Considerations
Note: Domain Names may appear in parts of an IRI other than the Note: Domain Names may appear in parts of an IRI other than the
ireg-name part. It is the responsibility of scheme-specific ireg-name part. It is the responsibility of scheme-specific
implementations (if the Internationalized Domain Name is part of implementations (if the Internationalized Domain Name is part of
the scheme syntax) or of server-side implementations (if the the scheme syntax) or of server-side implementations (if the
skipping to change at page 21, line 41 skipping to change at page 21, line 41
Section 1.2. To be able to use IRIs, the URI corresponding to the Section 1.2. To be able to use IRIs, the URI corresponding to the
IRI in question has to encode original characters into octets by IRI in question has to encode original characters into octets by
using UTF-8. This can be specified for all URIs of a URI scheme or using UTF-8. This can be specified for all URIs of a URI scheme or
can apply to individual URIs for schemes that do not specify how to can apply to individual URIs for schemes that do not specify how to
encode original characters. It can apply to the whole URI, or only encode original characters. It can apply to the whole URI, or only
to some part. For background information on encoding characters into to some part. For background information on encoding characters into
URIs, see also Section 2.5 of [RFC3986]. URIs, see also Section 2.5 of [RFC3986].
For new URI schemes, using UTF-8 is recommended in [RFC4395bis]. For new URI schemes, using UTF-8 is recommended in [RFC4395bis].
Examples where UTF-8 is already used are the URN syntax [RFC2141], Examples where UTF-8 is already used are the URN syntax [RFC2141],
IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, IMAP URLs [RFC2192], POP URLs [RFC2384], XMPP URLs [RFC5122], and the
because the HTTP URI scheme does not specify how to encode original 'mailto:' scheme [RFC6068]. On the other hand, because the HTTP URI
characters, only some HTTP URLs can have corresponding but different scheme does not specify how to encode original characters, only some
IRIs. HTTP URLs can have corresponding but different IRIs.
For example, for a document with a URI of For example, for a document with a URI of
"http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
construct a corresponding IRI (in XML notation, see Section 1.4): construct a corresponding IRI (in XML notation, see Section 1.4):
"http://www.example.org/résumé.html" ("é" stands for "http://www.example.org/résumé.html" ("é" stands for
the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent- the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent-
encoded representation of that character). On the other hand, for a encoded representation of that character). On the other hand, for a
document with a URI of "http://www.example.org/r%E9sum%E9.html", the document with a URI of "http://www.example.org/r%E9sum%E9.html", the
percent-encoding octets cannot be converted to actual characters in percent-encoded octets cannot be converted to actual characters in an
an IRI, as the percent-encoding is not based on UTF-8. IRI, as the percent-encoding is not based on UTF-8.
For most URI schemes, there is no need to upgrade their scheme For most URI schemes, there is no need to upgrade their scheme
definition in order for them to work with IRIs. The main case where definition in order for them to work with IRIs. The main case where
upgrading makes sense is when a scheme definition, or a particular upgrading makes sense is when a scheme definition, or a particular
component of a scheme, is strictly limited to the use of US-ASCII component of a scheme, is strictly limited to the use of US-ASCII
characters with no provision to include non-ASCII characters/octets characters with no provision to include non-ASCII characters/octets
via percent-encoding, or if a scheme definition currently uses highly via percent-encoding, or if a scheme definition currently uses highly
scheme-specific provisions for the encoding of non-ASCII characters. scheme-specific provisions for the encoding of non-ASCII characters.
An example of this is the mailto: scheme [RFC2368].
This specification updates the IANA registry of URI schemes to note This specification updates the IANA registry of URI schemes to note
their applicability to IRIs, see Section 8. All IRIs use URI their applicability to IRIs, see Section 8. All IRIs use URI
schemes, and all URIs with URI schemes can be used as IRIs, even schemes, and all URIs with URI schemes can be used as IRIs, even
though in some cases only by using URIs directly as IRIs, without any though in some cases only by using URIs directly as IRIs, without any
conversion. conversion.
Scheme definitions can impose restrictions on the syntax of scheme- Scheme definitions can impose restrictions on the syntax of scheme-
specific URIs; i.e., URIs that are admissible under the generic URI specific URIs; i.e., URIs that are admissible under the generic URI
syntax [RFC3986] may not be admissible due to narrower syntactic syntax [RFC3986] may not be admissible due to narrower syntactic
skipping to change at page 23, line 14 skipping to change at page 23, line 14
Similar considerations apply to query parts. The functionality of Similar considerations apply to query parts. The functionality of
IRIs (namely, to be able to include non-ASCII characters) can only be IRIs (namely, to be able to include non-ASCII characters) can only be
used if the query part is encoded in UTF-8. used if the query part is encoded in UTF-8.
5.5. Relative IRI References 5.5. Relative IRI References
Processing of relative IRI references against a base is handled Processing of relative IRI references against a base is handled
straightforwardly; the algorithms of [RFC3986] can be applied straightforwardly; the algorithms of [RFC3986] can be applied
directly, treating the characters additionally allowed in IRI directly, treating the characters additionally allowed in IRI
references in the same way that unreserved characters are in URI references in the same way that unreserved characters are treated in
references. URI references.
6. Legacy Extended IRIs (LEIRIs) 6. Legacy Extended IRIs (LEIRIs)
In some cases, there have been formats which have used a protocol In some cases, there have been formats which have used a protocol
element which is a variant of the IRI definition; these variants have element which is a variant of the IRI definition; these variants have
usually been somewhat less restricted in syntax. This section usually been somewhat less restricted in syntax. This section
provides a definition and a name (Legacy Extended IRI or LEIRI) for provides a definition and a name (Legacy Extended IRI or LEIRI) for
one of these variants used widely in XML-based protocols. This one of these variants used widely in XML-based protocols. This
variant has to be used with care; it requires further processing variant has to be used with care; it requires further processing
before being fully interchangeable as IRIs. New protocols and before being fully interchangeable as IRIs. New protocols and
skipping to change at page 23, line 39 skipping to change at page 23, line 39
The provisions in this section also apply to Legacy Extended IRI The provisions in this section also apply to Legacy Extended IRI
references. references.
6.1. Legacy Extended IRI Syntax 6.1. Legacy Extended IRI Syntax
This section defines Legacy Extended IRIs (LEIRIs). The syntax of This section defines Legacy Extended IRIs (LEIRIs). The syntax of
Legacy Extended IRIs is the same as that for <IRI-reference>, except Legacy Extended IRIs is the same as that for <IRI-reference>, except
that the ucschar production is replaced by the leiri-ucschar that the ucschar production is replaced by the leiri-ucschar
production: production:
leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|"
/ "\" / "^" / "`" / %x0-1F / %x7F-D7FF / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
/ %xE000-FFFD / %x10000-10FFFF / %xE000-FFFD / %x10000-10FFFF
The restriction on bidirectional formatting characters in [Bidi] is The restriction on bidirectional formatting characters in [Bidi] is
lifted. The iprivate production becomes redundant. lifted. The iprivate production becomes redundant.
Likewise, the syntax for Legacy Extended IRI references (LEIRI Likewise, the syntax for Legacy Extended IRI references (LEIRI
references) is the same as that for IRI references with the above references) is the same as that for IRI references with the above
replacement of ucschar with leiri-ucschar. replacement of ucschar with leiri-ucschar.
6.2. Conversion of Legacy Extended IRIs to IRIs 6.2. Conversion of Legacy Extended IRIs to IRIs
skipping to change at page 32, line 38 skipping to change at page 32, line 38
The characters additionally allowed in Legacy Extended IRIs introduce The characters additionally allowed in Legacy Extended IRIs introduce
additional security issues. For details, see Section 6.3. additional security issues. For details, see Section 6.3.
10. Acknowledgements 10. Acknowledgements
This document was derived from [RFC3987]; the acknowledgments from This document was derived from [RFC3987]; the acknowledgments from
that specification still apply. that specification still apply.
In addition, this document was influenced by contributions from (in In addition, this document was influenced by contributions from (in
no particular order)Norman Walsh, Richard Tobin, Henry S. Thomson, no particular order) Norman Walsh, Richard Tobin, Henry S. Thomson,
John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris John Cowan, Paul Grosso, the XML Core Working Group of the W3C, Chris
Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank
Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard
Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie
Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas
Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van
Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos
Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R. Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R.
Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse, Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse,
Michael A. Puls II, Dave Thaler, Tom Petch, John Klensin, Shawn Michael A. Puls II, Dave Thaler, Tom Petch, John Klensin, Shawn
skipping to change at page 37, line 32 skipping to change at page 37, line 32
6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011,
ISBN 978-1-936213-01-6)", October 2010. ISBN 978-1-936213-01-6)", October 2010.
[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms",
Unicode Standard Annex #15, March 2008, Unicode Standard Annex #15, March 2008,
<http://www.unicode.org/unicode/reports/tr15/ <http://www.unicode.org/unicode/reports/tr15/
tr15-23.html>. tr15-23.html>.
12.2. Informative References 12.2. Informative References
[Bidi] Duerst, M. and L. Masinter, "Guidelines for [Bidi] Duerst, M., Masinter, L., and A. Allawi, "Guidelines for
Internationalized Resource Identifiers with Bi-directional Internationalized Resource Identifiers with Bi-directional
Characters (Bidi IRIs)", draft-ietf-iri-bidi-guidelines-00 Characters (Bidi IRIs)", draft-ietf-iri-bidi-guidelines-02
(work in progress), August 2011. (work in progress), March 2012.
[CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T. [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
Texin, "Character Model for the World Wide Web: Resource Texin, "Character Model for the World Wide Web: Resource
Identifiers", World Wide Web Consortium Candidate Identifiers", World Wide Web Consortium Candidate
Recommendation, November 2004, Recommendation, November 2004,
<http://www.w3.org/TR/charmod-resid>. <http://www.w3.org/TR/charmod-resid>.
[Duerst97] [Duerst97]
Duerst, M., "The Properties and Promises of UTF-8", Proc. Duerst, M., "The Properties and Promises of UTF-8", Proc.
11th International Unicode Conference, San Jose , 11th International Unicode Conference, San Jose ,
September 1997, <http://www.ifi.unizh.ch/mml/mduerst/ September 1997, <http://www.ifi.unizh.ch/mml/mduerst/
papers/PDF/IUC11-UTF-8.pdf>. papers/PDF/IUC11-UTF-8.pdf>.
[Equivalence] [Equivalence]
Masinter, L. and M. Duerst, "Equivalence and Masinter, L. and M. Duerst, "Equivalence and
Canonicalization of Internationalized Resource Identifiers Canonicalization of Internationalized Resource Identifiers
(IRIs)", draft-ietf-iri-comparison-00 (work in progress), (IRIs)", draft-ietf-iri-comparison-01 (work in progress),
August 2011. March 2012.
[Gettys] Gettys, J., "URI Model Consequences", [Gettys] Gettys, J., "URI Model Consequences",
<http://www.w3.org/DesignIssues/ModelConsequences>. <http://www.w3.org/DesignIssues/ModelConsequences>.
[HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
Specification", World Wide Web Consortium Recommendation, Specification", World Wide Web Consortium Recommendation,
December 1999, December 1999,
<http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>. <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.
[LEIRI] Thompson, H., Tobin, R., and N. Walsh, "Legacy extended
IRIs for XML resource identification", World Wide Web
Consortium Note, November 2008,
<http://www.w3.org/TR/leiri/>.
[RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message
Bodies", RFC 2045, November 1996.
[RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin, M., and P. Svanberg, "The Report of Atkinson, R., Crispin, M., and P. Svanberg, "The Report of
the IAB Character Set Workshop held 29 February - 1 March, the IAB Character Set Workshop held 29 February - 1 March,
1996", RFC 2130, April 1997. 1996", RFC 2130, April 1997.
[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
[RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
Languages", BCP 18, RFC 2277, January 1998. Languages", BCP 18, RFC 2277, January 1998.
[RFC2368] Hoffman, P., Masinter, L., and J. Zawinski, "The mailto
URL scheme", RFC 2368, July 1998.
[RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998.
[RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396, Resource Identifiers (URI): Generic Syntax", RFC 2396,
August 1998. August 1998.
[RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397,
August 1998. August 1998.
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
skipping to change at page 39, line 17 skipping to change at page 39, line 5
[RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
Identifiers (IRIs)", RFC 3987, January 2005. Identifiers (IRIs)", RFC 3987, January 2005.
[RFC4395bis] [RFC4395bis]
Hansen, T., Hardie, T., and L. Masinter, "Guidelines and Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
Registration Procedures for New URI/IRI Schemes", Registration Procedures for New URI/IRI Schemes",
draft-ietf-iri-4395bis-irireg-03 (work in progress), draft-ietf-iri-4395bis-irireg-03 (work in progress),
July 2011. July 2011.
[RFC5122] Saint-Andre, P., "Internationalized Resource Identifiers
(IRIs) and Uniform Resource Identifiers (URIs) for the
Extensible Messaging and Presence Protocol (XMPP)",
RFC 5122, February 2008.
[RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on [RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
Encodings for Internationalized Domain Names", RFC 6055, Encodings for Internationalized Domain Names", RFC 6055,
February 2011. February 2011.
[RFC6082] Whistler, K., Adams, G., Duerst, M., Presuhn, R., and J. [RFC6068] Duerst, M., Masinter, L., and J. Zawinski, "The 'mailto'
Klensin, "Deprecating Unicode Language Tag Characters: RFC URI Scheme", RFC 6068, October 2010.
2482 is Historic", RFC 6082, November 2010.
[UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other
Markup Languages", Unicode Technical Report #20, World Markup Languages", Unicode Technical Report #20, World
Wide Web Consortium Note, June 2003, Wide Web Consortium Note, June 2003,
<http://www.w3.org/TR/unicode-xml/>. <http://www.w3.org/TR/unicode-xml/>.
[UTR36] Davis, M. and M. Suignard, "Unicode Security [UTR36] Davis, M. and M. Suignard, "Unicode Security
Considerations", Unicode Technical Report #36, Considerations", Unicode Technical Report #36,
August 2010, <http://unicode.org/reports/tr36/>. August 2010, <http://unicode.org/reports/tr36/>.
[XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking [XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking
Language (XLink) Version 1.0", World Wide Web Language (XLink) Version 1.0", World Wide Web
Consortium REC-xlink-20010627, June 2001, Consortium REC-xlink-20010627, June 2001,
<http://www.w3.org/TR/xlink/#link-locators>. <http://www.w3.org/TR/xlink/#link-locators>.
[XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
F. Yergeau, "Extensible Markup Language (XML) 1.0 (Forth F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth
Edition)", World Wide Web Consortium REC-xml-20081126, Edition)", World Wide Web Consortium REC-xml-20081116,
August 2006, <http://www.w3.org/TR/REC-xml>. November 2008, <http://www.w3.org/TR/REC-xml>.
[XMLNamespace]
Bray, T., Hollander, D., Layman, A., and R. Tobin,
"Namespaces in XML (Second Edition)", World Wide Web
Consortium REC-xml-names-20091208, August 2006,
<http://www.w3.org/TR/REC-xml-names>.
[XMLSchema] [XMLSchema]
Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
World Wide Web Consortium REC-xmlschema-2-20041028, World Wide Web Consortium REC-xmlschema-2-20041028,
May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.
[XPointer] [XPointer]
Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
Framework", World Wide Web Consortium REC-xptr-framework- Framework", World Wide Web Consortium REC-xptr-framework-
20030325, March 2003, 20030325, March 2003,
<http://www.w3.org/TR/xptr-framework/#escaping>. <http://www.w3.org/TR/xptr-framework/#escaping>.
Authors' Addresses Authors' Addresses
Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
possible, for example as "D&#252;rst" in XML and HTML.) possible, for example as "D&#252;rst" in XML and HTML)
Aoyama Gakuin University Aoyama Gakuin University
5-10-1 Fuchinobe 5-10-1 Fuchinobe
Sagamihara, Kanagawa 229-8558 Sagamihara, Kanagawa 229-8558
Japan Japan
Phone: +81 42 759 6329 Phone: +81 42 759 6329
Fax: +81 42 759 6495 Fax: +81 42 759 6495
Email: duerst@it.aoyama.ac.jp Email: duerst@it.aoyama.ac.jp
URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
(Note: This is the percent-encoded form of an IRI.) (Note: This is the percent-encoded form of an IRI)
Michel Suignard Michel Suignard
Unicode Consortium Unicode Consortium
P.O. Box 391476 P.O. Box 391476
Mountain View, CA 94039-1476 Mountain View, CA 94039-1476
U.S.A. U.S.A.
Phone: +1-650-693-3921 Phone: +1-650-693-3921
Email: michel@unicode.org Email: michel@unicode.org
URI: http://www.suignard.com URI: http://www.suignard.com
 End of changes. 35 change blocks. 
86 lines changed or deleted 69 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/