draft-ietf-iri-3987bis-09.txt   draft-ietf-iri-3987bis-10.txt 
Internationalized Resource Identifiers M. Duerst Internationalized Resource Identifiers M. Duerst
(iri) Aoyama Gakuin University (iri) Aoyama Gakuin University
Internet-Draft M. Suignard Internet-Draft M. Suignard
Obsoletes: 3987 (if approved) Unicode Consortium Obsoletes: 3987 (if approved) Unicode Consortium
Intended status: Standards Track L. Masinter Intended status: Standards Track L. Masinter
Expires: July 12, 2012 Adobe Expires: September 3, 2012 Adobe
January 9, 2012 March 2, 2012
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-ietf-iri-3987bis-09 draft-ietf-iri-3987bis-10
Abstract Abstract
This document defines the Internationalized Resource Identifier (IRI) This document defines the Internationalized Resource Identifier (IRI)
protocol element, as an extension of the Uniform Resource Identifier protocol element, as an extension of the Uniform Resource Identifier
(URI). An IRI is a sequence of characters from the Universal (URI). An IRI is a sequence of characters from the Universal
Character Set (Unicode/ISO 10646). Grammar and processing rules are Character Set (Unicode/ISO 10646). Grammar and processing rules are
given for IRIs and related syntactic forms. given for IRIs and related syntactic forms.
Defining IRI as new protocol element (rather than updating or Defining IRI as new protocol element (rather than updating or
skipping to change at page 2, line 15 skipping to change at page 2, line 15
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on July 12, 2012. This Internet-Draft will expire on September 3, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 21 skipping to change at page 3, line 21
1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 8
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 9 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 9
2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10
3. Processing IRIs and related protocol elements . . . . . . . . 13 3. Processing IRIs and related protocol elements . . . . . . . . 13
3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 13 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 13
3.2. Parse the IRI into IRI components . . . . . . . . . . . . 13 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 13
3.3. General percent-encoding of IRI components . . . . . . . 14 3.3. General percent-encoding of IRI components . . . . . . . 14
3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14
3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 14 3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 14
3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 14 3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 15
3.4.3. Additional Considerations . . . . . . . . . . . . . . 15 3.4.3. Additional Considerations . . . . . . . . . . . . . . 15
3.5. Mapping query components . . . . . . . . . . . . . . . . 16 3.5. Mapping query components . . . . . . . . . . . . . . . . 16
3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 16 3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 16
4. Converting URIs to IRIs . . . . . . . . . . . . . . . . . . . 16 4. Converting URIs to IRIs . . . . . . . . . . . . . . . . . . . 16
4.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . 18
5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 19 5.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 19
5.2. Software Interfaces and Protocols . . . . . . . . . . . . 20 5.2. Software Interfaces and Protocols . . . . . . . . . . . . 20
5.3. Format of URIs and IRIs in Documents and Protocols . . . 20 5.3. Format of URIs and IRIs in Documents and Protocols . . . 21
5.4. Use of UTF-8 for Encoding Original Characters . . . . . . 20 5.4. Use of UTF-8 for Encoding Original Characters . . . . . . 21
5.5. Relative IRI References . . . . . . . . . . . . . . . . . 22 5.5. Relative IRI References . . . . . . . . . . . . . . . . . 23
6. Legacy Extended IRIs (LEIRIs) . . . . . . . . . . . . . . . . 22 6. Legacy Extended IRIs (LEIRIs) . . . . . . . . . . . . . . . . 23
6.1. Legacy Extended IRI Syntax . . . . . . . . . . . . . . . 23 6.1. Legacy Extended IRI Syntax . . . . . . . . . . . . . . . 23
6.2. Conversion of Legacy Extended IRIs to IRIs . . . . . . . 23 6.2. Conversion of Legacy Extended IRIs to IRIs . . . . . . . 24
6.3. Characters Allowed in Legacy Extended IRIs but not in 6.3. Characters Allowed in Legacy Extended IRIs but not in
IRIs . . . . . . . . . . . . . . . . . . . . . . . . . . 23 IRIs . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 25 7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 26
7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 25 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 26
7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 26 7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 26
7.3. URI/IRI Transfer between Applications . . . . . . . . . . 26 7.3. URI/IRI Transfer between Applications . . . . . . . . . . 27
7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 27 7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 27
7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 28
7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 28 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 29
7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 28 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 29
7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 29 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 30
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31
9. Security Considerations . . . . . . . . . . . . . . . . . . . 30 9. Security Considerations . . . . . . . . . . . . . . . . . . . 31
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32
11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 32 11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 33
11.1. Split out Bidi, processing guidelines, comparison 11.1. Split out Bidi, processing guidelines, comparison
sections . . . . . . . . . . . . . . . . . . . . . . . . 32 sections . . . . . . . . . . . . . . . . . . . . . . . . 33
11.2. Major restructuring of IRI processing model . . . . . . . 32 11.2. Major restructuring of IRI processing model . . . . . . . 33
11.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 32 11.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 33
11.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 33 11.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 34
11.2.3. Extension of Syntax . . . . . . . . . . . . . . . . . 33 11.2.3. Extension of Syntax . . . . . . . . . . . . . . . . . 34
11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 33 11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 34
11.3. Change Log . . . . . . . . . . . . . . . . . . . . . . . 33 11.3. Change Log . . . . . . . . . . . . . . . . . . . . . . . 34
11.3.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 33 11.3.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 34
11.3.2. Changes from draft-duerst-iri-bis-07 to 11.3.2. Changes from draft-duerst-iri-bis-07 to
draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 34 draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 34
11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 34 11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 34
11.4. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 34 11.4. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 35
11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 34 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 35
11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 34 11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 35
11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 34 11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 35
11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35 11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35
11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35 11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35
11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 35 11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 36
11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 35 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 36
12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36
12.1. Normative References . . . . . . . . . . . . . . . . . . 35 12.1. Normative References . . . . . . . . . . . . . . . . . . 36
12.2. Informative References . . . . . . . . . . . . . . . . . 36 12.2. Informative References . . . . . . . . . . . . . . . . . 37
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 40
1. Introduction 1. Introduction
1.1. Overview and Motivation 1.1. Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFC3986] as a A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
sequence of characters chosen from a limited subset of the repertoire sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters. of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of The characters in URIs are frequently used for representing words of
skipping to change at page 11, line 27 skipping to change at page 11, line 27
ipath = ipath-abempty ; begins with "/" or is empty ipath = ipath-abempty ; begins with "/" or is empty
/ ipath-absolute ; begins with "/" but not "//" / ipath-absolute ; begins with "/" but not "//"
/ ipath-noscheme ; begins with a non-colon segment / ipath-noscheme ; begins with a non-colon segment
/ ipath-rootless ; begins with a segment / ipath-rootless ; begins with a segment
/ ipath-empty ; zero characters / ipath-empty ; zero characters
ipath-abempty = *( path-sep isegment ) ipath-abempty = *( path-sep isegment )
ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ] ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
ipath-noscheme = isegment-nz-nc *( path-sep isegment ) ipath-noscheme = isegment-nz-nc *( path-sep isegment )
ipath-rootless = isegment-nz *( path-sep isegment ) ipath-rootless = isegment-nz *( path-sep isegment )
ipath-empty = 0<ipchar> ipath-empty = ""
path-sep = "/" path-sep = "/"
isegment = *ipchar isegment = *ipchar
isegment-nz = 1*ipchar isegment-nz = 1*ipchar
isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
/ "@" ) / "@" )
; non-zero-length segment without any colon ":" ; non-zero-length segment without any colon ":"
ipchar = iunreserved / pct-form / sub-delims / ":" ipchar = iunreserved / pct-form / sub-delims / ":"
/ "@" / "@"
skipping to change at page 14, line 33 skipping to change at page 14, line 33
is the hexadecimal notation of the octet value. The hexadecimal is the hexadecimal notation of the octet value. The hexadecimal
notation SHOULD use uppercase letters. (This is the general URI notation SHOULD use uppercase letters. (This is the general URI
percent-encoding mechanism in Section 2.1 of [RFC3986].) percent-encoding mechanism in Section 2.1 of [RFC3986].)
Note that the mapping is an identity transformation for parsed URI Note that the mapping is an identity transformation for parsed URI
components of valid URIs, and is idempotent: applying the mapping a components of valid URIs, and is idempotent: applying the mapping a
second time will not change anything. second time will not change anything.
3.4. Mapping ireg-name 3.4. Mapping ireg-name
The mapping from <ireg-name> to a <reg-name> requires a choice
between one of the two methods described below.
3.4.1. Mapping using Percent-Encoding 3.4.1. Mapping using Percent-Encoding
The ireg-name component SHOULD be converted according to the general The ireg-name component SHOULD be converted according to the general
procedure for percent-encoding of IRI components described in procedure for percent-encoding of IRI components described in
Section 3.3. Section 3.3.
For example, the IRI For example, the IRI
"http://r&#xE9;sum&#xE9;.example.org" "http://r&#xE9;sum&#xE9;.example.org"
will be converted to will be converted to
"http://r%C3%A9sum%C3%A9.example.org". "http://r%C3%A9sum%C3%A9.example.org".
This conversion for ireg-name is in line with Section 3.2.2 of This conversion for ireg-name is in line with Section 3.2.2 of
[RFC3986], which does not mandate a particular registered name lookup [RFC3986], which does not mandate a particular registered name lookup
technology. For further background, see [RFC6055] and [Gettys]. technology. For further background, see [RFC6055] and [Gettys].
3.4.2. Mapping using Punycode 3.4.2. Mapping using Punycode
The ireg-name component MAY also be converted as follows: In situations where it is certain that <ireg-name> is intended to be
used as a domain name to be processed by Domain Name Lookup (as per
[RFC5891]), an alternative method MAY be used, converting <ireg-name>
as follows:
If there are any sequences of <pct-encoded>, and their corresponding If there are any sequences of <pct-encoded>, and their corresponding
octets all represent valid UTF-8 octet sequences, then convert these octets all represent valid UTF-8 octet sequences, then convert these
back to Unicode character sequences. (If any <pct-encoded> sequences back to Unicode character sequences. (If any <pct-encoded> sequences
are not valid UTF-8 octet sequences, then leave the entire field as are not valid UTF-8 octet sequences, then leave the entire field as
is without any change, since punycode encoding would not succeed.) is without any change, since punycode encoding would not succeed.)
Replace the ireg-name part of the IRI by the part converted using the Replace the ireg-name part of the IRI by the part converted using the
Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891]. Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891].
on each dot-separated label, and by using U+002E (FULL STOP) as a on each dot-separated label, and by using U+002E (FULL STOP) as a
skipping to change at page 20, line 5 skipping to change at page 20, line 21
cannot and should not check for such limitations.) cannot and should not check for such limitations.)
b. The UCS contains many areas of characters for which there are b. The UCS contains many areas of characters for which there are
strong visual look-alikes. Because of the likelihood of strong visual look-alikes. Because of the likelihood of
transcription errors, these also should be avoided. This includes transcription errors, these also should be avoided. This includes
the full-width equivalents of Latin characters, half-width the full-width equivalents of Latin characters, half-width
Katakana characters for Japanese, and many others. It also Katakana characters for Japanese, and many others. It also
includes many look-alikes of "space", "delims", and "unwise", includes many look-alikes of "space", "delims", and "unwise",
characters excluded in [RFC3491]. characters excluded in [RFC3491].
c. At the start of a component, the use of combining marks is
strongly discouraged. As an example, a COMBINING TILDE OVERLAY
(U+0334) would be very confusing at the start of a <isegment>.
Combined with the preceeding '/', it might look like a solidus
with combining tilde overlay, but IRI processing software will
parse and process the '/' separately.
d. The ZERO WIDTH NON-JOINER (U+200C) and ZERO WIDTH JOINER (U+200D)
are invisible in most contexts, but are crucial in some very
limited contexts. Appendix A of [RFC5892] contains contextual
restrictions for these and some other characters. The use of
these characters are strongly discouraged except in the relevant
contexts.
Additional information is available from [UNIXML]. [UNIXML] is Additional information is available from [UNIXML]. [UNIXML] is
written in the context of running text rather than in that of written in the context of running text rather than in that of
identifiers. Nevertheless, it discusses many of the categories of identifiers. Nevertheless, it discusses many of the categories of
characters not appropriate for IRIs. characters not appropriate for IRIs.
5.2. Software Interfaces and Protocols 5.2. Software Interfaces and Protocols
Although an IRI is defined as a sequence of characters, software Although an IRI is defined as a sequence of characters, software
interfaces for URIs typically function on sequences of octets or interfaces for URIs typically function on sequences of octets or
other kinds of code units. Thus, software interfaces and protocols other kinds of code units. Thus, software interfaces and protocols
skipping to change at page 22, line 37 skipping to change at page 23, line 19
5.5. Relative IRI References 5.5. Relative IRI References
Processing of relative IRI references against a base is handled Processing of relative IRI references against a base is handled
straightforwardly; the algorithms of [RFC3986] can be applied straightforwardly; the algorithms of [RFC3986] can be applied
directly, treating the characters additionally allowed in IRI directly, treating the characters additionally allowed in IRI
references in the same way that unreserved characters are in URI references in the same way that unreserved characters are in URI
references. references.
6. Legacy Extended IRIs (LEIRIs) 6. Legacy Extended IRIs (LEIRIs)
For historic reasons, some formats have allowed variants of IRIs that In some cases, there have been formats which have used a protocol
are somewhat less restricted in syntax. This section provides a element which is a variant of the IRI definition; these variants have
definition and a name (Legacy Extended IRI or LEIRI) for these usually been somewhat less restricted in syntax. This section
variants for easier reference. These variants have to be used with provides a definition and a name (Legacy Extended IRI or LEIRI) for
care; they require further processing before being fully one of these variants used widely in XML-based protocols. This
interchangeable as IRIs. New protocols and formats SHOULD NOT use variant has to be used with care; it requires further processing
Legacy Extended IRIs. Even where Legacy Extended IRIs are allowed, before being fully interchangeable as IRIs. New protocols and
only IRIs fully conforming to the syntax definition in Section 2.2 formats SHOULD NOT use Legacy Extended IRIs. Even where Legacy
SHOULD be created, generated, and used. The provisions in this Extended IRIs are allowed, only IRIs fully conforming to the syntax
section also apply to Legacy Extended IRI references. definition in Section 2.2 SHOULD be created, generated, and used.
The provisions in this section also apply to Legacy Extended IRI
references.
6.1. Legacy Extended IRI Syntax 6.1. Legacy Extended IRI Syntax
The syntax of Legacy Extended IRIs is the same as that for IRIs, This section defines Legacy Extended IRIs (LEIRIs). The syntax of
except that ucschar is redefined as follows: Legacy Extended IRIs is the same as that for <IRI-reference>, except
that the ucschar production is replaced by the leiri-ucschar
production:
ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|"
/ "\" / "^" / "`" / %x0-1F / %x7F-D7FF / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
/ %xE000-FFFD / %x10000-10FFFF / %xE000-FFFD / %x10000-10FFFF
The restriction on bidirectional formatting characters in [Bidi] is The restriction on bidirectional formatting characters in [Bidi] is
lifted. The iprivate production becomes redundant. lifted. The iprivate production becomes redundant.
Likewise, the syntax for Legacy Extended IRI references (LEIRI Likewise, the syntax for Legacy Extended IRI references (LEIRI
references) is the same as that for IRI references with the above references) is the same as that for IRI references with the above
redefinition of ucschar applied. replacement of ucschar with leiri-ucschar.
Formats that use Legacy Extended IRIs or Legacy Extended IRI
references MAY further restrict the characters allowed therein,
either implicitly by the fact that the format as such does not allow
some characters, or explicitly. An example of a character not
allowed implicitly may be the NUL character (U+0000). However, all
the characters allowed in IRIs MUST still be allowed.
6.2. Conversion of Legacy Extended IRIs to IRIs 6.2. Conversion of Legacy Extended IRIs to IRIs
To convert a Legacy Extended IRI (reference) to an IRI (reference), To convert a Legacy Extended IRI (reference) to an IRI (reference),
each character allowed in a Legacy Extended IRI (reference) but not each character allowed in a Legacy Extended IRI (reference) but not
allowed in an IRI (reference) (see Section 6.3) MUST be percent- allowed in an IRI (reference) (see Section 6.3) MUST be percent-
encoded by applying steps 2.1 to 2.3 of Section 3.6. encoded by applying the steps in Section 3.3.
6.3. Characters Allowed in Legacy Extended IRIs but not in IRIs 6.3. Characters Allowed in Legacy Extended IRIs but not in IRIs
This section provides a list of the groups of characters and code This section provides a list of the groups of characters and code
points that are allowed in Legacy Extedend IRIs, but are not allowed points that are allowed in Legacy Extedend IRIs, but are not allowed
in IRIs or are allowed in IRIs only in the query part. For each in IRIs or are allowed in IRIs only in the query part. For each
group of characters, advice on the usage of these characters is also group of characters, advice on the usage of these characters is also
given, concentrating on the reasons for why not to use them. given, concentrating on the reasons for why not to use them.
Space (U+0020): Some formats and applications use space as a Space (U+0020): Some formats and applications use space as a
delimiter, e.g. for items in a list. Appendix C of [RFC3986] also delimiter, e.g., for items in a list. Appendix C of [RFC3986]
mentions that white space may have to be added when displaying or also mentions that white space may have to be added when
printing long URIs; the same applies to long IRIs. This means displaying or printing long URIs; the same applies to long IRIs.
that spaces can disappear, or can make the Legacy Extended IRI to Spaces might disappear, or a single Legacy Extended IRI might
be interpreted as two or more separate IRIs. incorrectly be interpreted as two or more separate ones.
Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix
C of [RFC3986] suggests the use of double-quotes C of [RFC3986] suggests the use of double-quotes
("http://example.com/") and angle brackets (<http://example.com/>) ("http://example.com/") and angle brackets (<http://example.com/>)
as delimiters for URIs in plain text. These conventions are often as delimiters for URIs in plain text. These conventions are often
used, and also apply to IRIs. Legacy Extended IRIs using these used, and also apply to IRIs. Legacy Extended IRIs using these
characters will be cut off at the wrong place. characters might be cut off at the wrong place.
Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{" Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{"
(U+007B), "|" (U+007C), and "}" (U+007D): These characters (U+007B), "|" (U+007C), and "}" (U+007D): These characters
originally have been excluded from URIs because the respective originally were excluded from URIs because the respective
codepoints are assigned to different graphic characters in some codepoints are assigned to different graphic characters in some
7-bit or 8-bit encoding. Despite the move to Unicode, some of 7-bit or 8-bit encoding. Despite the move to Unicode, some of
these characters are still occasionally displayed differently on these characters are still occasionally displayed differently on
some systems, e.g. U+005C as a Japanese Yen symbol. Also, the some systems, e.g., U+005C as a Japanese Yen symbol. Also, the
fact that these characters are not used in URIs or IRIs has fact that these characters are not used in URIs or IRIs has
encouraged their use outside URIs or IRIs in contexts that may encouraged their use outside URIs or IRIs in contexts that may
include URIs or IRIs. In case a Legacy Extended IRI with such a include URIs or IRIs. In case a Legacy Extended IRI with such a
character is used in such a context, the Legacy Extended IRI will character is used in such a context, the Legacy Extended IRI will
be interpreted piecemeal. be interpreted piecemeal.
The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F - The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F -
#x9F): There is no way to transmit these characters reliably #x9F): There is no way to transmit these characters reliably
except potentially in electronic form. Even when in electronic except potentially in electronic form. Even when in electronic
form, some software components might silently filter out some of form, some software components might silently filter out some of
skipping to change at page 25, line 8 skipping to change at page 25, line 32
Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000- Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
10FFFD): Display and interpretation of these code points is by 10FFFD): Display and interpretation of these code points is by
definition undefined without private agreement. Therefore, these definition undefined without private agreement. Therefore, these
code points are not suited for use on the Internet. They are not code points are not suited for use on the Internet. They are not
interoperable and may have unpredictable effects. interoperable and may have unpredictable effects.
Tags (U+E0000-E0FFF): These characters provide a way to language Tags (U+E0000-E0FFF): These characters provide a way to language
tag in Unicode plain text. They are not appropriate for Legacy tag in Unicode plain text. They are not appropriate for Legacy
Extended IRIs because language information in identifiers cannot Extended IRIs because language information in identifiers cannot
reliably be input, transmitted (e.g. on a visual medium such as reliably be input, transmitted (e.g., on a visual medium such as
paper), or recognized. paper), or recognized.
Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF, Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF,
U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF, U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF,
U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF, U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF,
U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF, U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF,
U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as
non-characters. Applications may use some of them internally, but non-characters. Applications may use some of them internally, but
are not prepared to interchange them. are not prepared to interchange them.
For reference, we here also list the code points and code units not For reference, we here also list the code points and code units not
even allowed in Legacy Extended IRIs: even allowed in Legacy Extended IRIs:
Surrogate code units (D800-DFFF): These do not represent Unicode Surrogate code units (D800-DFFF): These do not represent Unicode
codepoints. codepoints.
Non-characters (U+FFFE-FFFF): These are not allowed in XML nor
LEIRIs.
7. URI/IRI Processing Guidelines (Informative) 7. URI/IRI Processing Guidelines (Informative)
This informative section provides guidelines for supporting IRIs in This informative section provides guidelines for supporting IRIs in
the same software components and operations that currently process the same software components and operations that currently process
URIs: Software interfaces that handle URIs, software that allows URIs: Software interfaces that handle URIs, software that allows
users to enter URIs, software that creates or generates URIs, users to enter URIs, software that creates or generates URIs,
software that displays URIs, formats and protocols that transport software that displays URIs, formats and protocols that transport
URIs, and software that interprets URIs. These may all require URIs, and software that interprets URIs. These may all require
modification before functioning properly with IRIs. The modification before functioning properly with IRIs. The
considerations in this section also apply to URI references and IRI considerations in this section also apply to URI references and IRI
skipping to change at page 30, line 33 skipping to change at page 31, line 11
the character encoding) and will therefore be compatible with IRIs. the character encoding) and will therefore be compatible with IRIs.
These recommendations, when taken together, will allow for the These recommendations, when taken together, will allow for the
extension from URIs to IRIs in order to handle characters other than extension from URIs to IRIs in order to handle characters other than
US-ASCII while minimizing interoperability problems. For US-ASCII while minimizing interoperability problems. For
considerations regarding the upgrade of URI scheme definitions, see considerations regarding the upgrade of URI scheme definitions, see
Section 5.4. Section 5.4.
8. IANA Considerations 8. IANA Considerations
NOTE: THIS SECTION NEEDS REVIEW AGAINST HAPPIANA WORK.
RFC Editor and IANA note: Please Replace RFC XXXX with the number of RFC Editor and IANA note: Please Replace RFC XXXX with the number of
this document when it issues as an RFC. this document when it issues as an RFC, and RFC YYYY with the number
of the RFC issued for draft-ietf-iri-rfc3987bis.
IANA maintains a registry of "URI schemes". A "URI scheme" also IANA maintains a registry of "URI schemes". This document attempts
serves an "IRI scheme". to make it clear from the registry that a "URI scheme" also serves an
"IRI scheme", and makes several changes to the registry.
To clarify that the URI scheme registration process also applies to The description of the registry should be changed: "RFC 4395 defined
IRIs, change the description of the "URI schemes" registry header to an IANA-maintained registry of URI Schemes. RFC XXXX updates this
say "[RFC4395] defines an IANA-maintained registry of URI Schemes. registry to make it clear that the registered values also serve as
These registries include the Permanent and Provisional URI Schemes. IRI schemes, as defined in RFC YYYY."
RFC XXXX updates this registry to designate that schemes may also
indicate their usability as IRI schemes.
Update "per RFC 4395" to "per RFC 4395 and RFC XXXX". The registry includes schemes marked as Permanent or Provisional.
Previously, this was accomplished by having two sections, "Permanent"
and "Provisional". However, in order to allow other status
("Historical", and possibly a Proposed status for proposals which
have been received but not accepted), the registry should be changed
so that the status is indicated in a separate "Status" column, whose
values may be "Permanent", "Provisional" or "Historical". Changes in
status as well as updates to the entire registration may be
accomplished by requests and expert review.
9. Security Considerations 9. Security Considerations
The security considerations discussed in [RFC3986] also apply to The security considerations discussed in [RFC3986] also apply to
IRIs. In addition, the following issues require particular care for IRIs. In addition, the following issues require particular care for
IRIs. IRIs.
Incorrect encoding or decoding can lead to security problems. For Incorrect encoding or decoding can lead to security problems. For
example, some UTF-8 decoders do not check against overlong byte example, some UTF-8 decoders do not check against overlong byte
sequences. See [UTR36] Section 3 for details. sequences. See [UTR36] Section 3 for details.
skipping to change at page 36, line 26 skipping to change at page 37, line 14
Resource Identifier (URI): Generic Syntax", STD 66, Resource Identifier (URI): Generic Syntax", STD 66,
RFC 3986, January 2005. RFC 3986, January 2005.
[RFC5890] Klensin, J., "Internationalized Domain Names for [RFC5890] Klensin, J., "Internationalized Domain Names for
Applications (IDNA): Definitions and Document Framework", Applications (IDNA): Definitions and Document Framework",
RFC 5890, August 2010. RFC 5890, August 2010.
[RFC5891] Klensin, J., "Internationalized Domain Names in [RFC5891] Klensin, J., "Internationalized Domain Names in
Applications (IDNA): Protocol", RFC 5891, August 2010. Applications (IDNA): Protocol", RFC 5891, August 2010.
[RFC5892] Faltstrom, P., "The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA)",
RFC 5892, August 2010.
[STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax [STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", STD 68, RFC 5234, January 2008. Specifications: ABNF", STD 68, RFC 5234, January 2008.
[UNIV6] The Unicode Consortium, "The Unicode Standard, Version [UNIV6] The Unicode Consortium, "The Unicode Standard, Version
6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011,
ISBN 978-1-936213-01-6)", October 2010. ISBN 978-1-936213-01-6)", October 2010.
[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms",
Unicode Standard Annex #15, March 2008, Unicode Standard Annex #15, March 2008,
<http://www.unicode.org/unicode/reports/tr15/ <http://www.unicode.org/unicode/reports/tr15/
skipping to change at page 39, line 22 skipping to change at page 40, line 15
[XPointer] [XPointer]
Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
Framework", World Wide Web Consortium REC-xptr-framework- Framework", World Wide Web Consortium REC-xptr-framework-
20030325, March 2003, 20030325, March 2003,
<http://www.w3.org/TR/xptr-framework/#escaping>. <http://www.w3.org/TR/xptr-framework/#escaping>.
Authors' Addresses Authors' Addresses
Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
possible, for example as "D&#252;rst" in XML and HTML.) possible, for example as "D&#252;rst" in XML and HTML.)
Aoyama Gakuin University Aoyama Gakuin University
5-10-1 Fuchinobe 5-10-1 Fuchinobe
Sagamihara, Kanagawa 229-8558 Sagamihara, Kanagawa 229-8558
Japan Japan
Phone: +81 42 759 6329 Phone: +81 42 759 6329
Fax: +81 42 759 6495 Fax: +81 42 759 6495
Email: duerst@it.aoyama.ac.jp Email: duerst@it.aoyama.ac.jp
URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
(Note: This is the percent-encoded form of an IRI.) (Note: This is the percent-encoded form of an IRI.)
 End of changes. 35 change blocks. 
84 lines changed or deleted 118 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/