--- 1/draft-ietf-iri-3987bis-02.txt 2010-10-25 11:15:47.000000000 +0200
+++ 2/draft-ietf-iri-3987bis-03.txt 2010-10-25 11:15:47.000000000 +0200
@@ -1,21 +1,21 @@
Internationalized Resource M. Duerst
Identifiers (iri) Aoyama Gakuin University
Internet-Draft M. Suignard
Obsoletes: 3987 (if approved) Unicode Consortium
Intended status: Standards Track L. Masinter
-Expires: April 20, 2011 Adobe
- October 17, 2010
+Expires: April 28, 2011 Adobe
+ October 25, 2010
Internationalized Resource Identifiers (IRIs)
- draft-ietf-iri-3987bis-02
+ draft-ietf-iri-3987bis-03
Abstract
This document defines the Internationalized Resource Identifier (IRI)
protocol element, as an extension of the Uniform Resource Identifier
(URI). An IRI is a sequence of characters from the Universal
Character Set (Unicode/ISO 10646). Grammar and processing rules are
given for IRIs and related syntactic forms.
In addition, this document provides named additional rule sets for
@@ -55,21 +55,21 @@
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
- This Internet-Draft will expire on April 20, 2011.
+ This Internet-Draft will expire on April 28, 2011.
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
@@ -98,47 +98,47 @@
1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6
1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 6
1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 9
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 10
2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10
3. Processing IRIs and related protocol elements . . . . . . . . 13
3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 14
3.2. Parse the IRI into IRI components . . . . . . . . . . . . 14
3.3. General percent-encoding of IRI components . . . . . . . 15
- 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 15
+ 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 16
3.5. Mapping query components . . . . . . . . . . . . . . . . 17
3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 17
3.7. Converting URIs to IRIs . . . . . . . . . . . . . . . . . 17
3.7.1. Examples . . . . . . . . . . . . . . . . . . . . . . . 19
4. Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 20
4.1. Logical Storage and Visual Presentation . . . . . . . . . 21
4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 22
4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 23
4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . 23
5. Normalization and Comparison . . . . . . . . . . . . . . . . . 25
- 5.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . . 25
+ 5.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . . 26
5.2. Preparation for Comparison . . . . . . . . . . . . . . . 26
5.3. Comparison Ladder . . . . . . . . . . . . . . . . . . . . 27
5.3.1. Simple String Comparison . . . . . . . . . . . . . . . 27
5.3.2. Syntax-Based Normalization . . . . . . . . . . . . . . 28
5.3.3. Scheme-Based Normalization . . . . . . . . . . . . . . 31
5.3.4. Protocol-Based Normalization . . . . . . . . . . . . . 32
- 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 32
+ 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 33
6.2. Software Interfaces and Protocols . . . . . . . . . . . . 33
- 6.3. Format of URIs and IRIs in Documents and Protocols . . . 33
+ 6.3. Format of URIs and IRIs in Documents and Protocols . . . 34
6.4. Use of UTF-8 for Encoding Original Characters . . . . . . 34
6.5. Relative IRI References . . . . . . . . . . . . . . . . . 36
7. Liberal handling of otherwise invalid IRIs . . . . . . . . . . 36
7.1. LEIRI processing . . . . . . . . . . . . . . . . . . . . 36
- 7.2. Web Address processing . . . . . . . . . . . . . . . . . 36
+ 7.2. Web Address processing . . . . . . . . . . . . . . . . . 37
7.3. Characters not allowed in IRIs . . . . . . . . . . . . . 38
8. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 40
8.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 40
8.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 41
8.3. URI/IRI Transfer between Applications . . . . . . . . . . 42
8.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 42
8.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 43
8.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 43
8.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 44
8.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 44
@@ -215,21 +215,21 @@
discusses various forms of equivalence between IRIs. Section 6
discusses the use of IRIs in different situations. Section 8 gives
additional informative guidelines. Section 10 discusses IRI-specific
security considerations.
1.2. Applicability
IRIs are designed to allow protocols and software that deal with URIs
to be updated to handle IRIs. A "URI scheme" (as defined by
[RFC3986] and registered through the IANA process defined in
- [RFC4395] also serves as an "IRI scheme". Processing of IRIs is
+ [RFC4395bis] also serves as an "IRI scheme". Processing of IRIs is
accomplished by extending the URI syntax while retaining (and not
expanding) the set of "reserved" characters, such that the syntax for
any URI scheme may be uniformly extended to allow non-ASCII
characters. In addition, following parsing of an IRI, it is possible
to construct a corresponding URI by first encoding characters outside
of the allowed URI range and then reassembling the components.
Practical use of IRIs forms in place of URIs forms depends on the
following conditions being met:
@@ -245,21 +245,21 @@
b. The protocol or format carrying the IRIs MUST have a mechanism to
represent the wide range of characters used in IRIs, either
natively or by some protocol- or format-specific escaping
mechanism (for example, numeric character references in [XML1]).
c. The URI scheme definition, if it explicitly allows a percent sign
("%") in any syntactic component, SHOULD define the interpretation
of sequences of percent-encoded octets (using "%XX" hex octets) as
octet from sequences of UTF-8 encoded strings; this is recommended
- in the guidelines for registering new schemes, [RFC4395]. For
+ in the guidelines for registering new schemes, [RFC4395bis]. For
example, this is the practice for IMAP URLs [RFC2192], POP URLs
[RFC2384] and the URN syntax [RFC2141]). Note that use of
percent-encoding may also be restricted in some situations, for
example, URI schemes that disallow percent-encoding might still be
used with a fragment identifier which is percent-encoded (e.g.,
[XPointer]). See Section 6.4 for further discussion.
1.3. Definitions
The following definitions are used in this document; they follow the
@@ -575,27 +575,33 @@
and IRI references (i.e., absolute or relative forms); for IRIs, some
steps are scheme specific.
3.1. Converting to UCS
Input that is already in a Unicode form (i.e., a sequence of Unicode
characters or an octet-stream representing a Unicode-based character
encoding such as UTF-8 or UTF-16) should be left as is and not
normalized (see (see Section 5.3.2.2).
- If the IRI or IRI reference is an octet stream in some known non-
- Unicode character encoding, convert the IRI to a sequence of
- characters from the UCS; this sequence SHOULD also be normalized
- according to Unicode Normalization Form C (NFC, [UTR15]). In this
- case, retain the original character encoding as the "document
- character encoding". (DESIGN QUESTION: NOT WHAT MOST IMPLEMENTATIONS
- DO, CHANGE? )
+ An IRI or IRI reference is a sequence of characters from the UCS.
+ For IRIs that are not already in a Unicode form (as when written on
+ paper, read aloud, or represented in a text stream using a legacy
+ character encoding), convert the IRI to Unicode. Note that some
+ character encodings or transcriptions can be converted to or
+ represented by more than one sequence of Unicode characters. Ideally
+ the resulting IRI would use a normalized form, such as Unicode
+ Normalization Form C [UTR15] (see Section 5.3 Normalization and
+ Comparison), since that ensures a stable, consistent representation
+ that is most likely to produce the intended results. Implementers
+ and users are cautioned that, while denormalized character sequences
+ are valid, they might be difficult for other users or processes to
+ reproduce and might lead to unexpected results.
In other cases (written on paper, read aloud, or otherwise
represented independent of any character encoding) represent the IRI
as a sequence of characters from the UCS normalized according to
Unicode Normalization Form C (NFC, [UTR15]).
3.2. Parse the IRI into IRI components
Parse the IRI, either as a relative reference (no scheme) or using
scheme specific processing (according to the scheme given); the
@@ -1535,21 +1538,21 @@
This section discusses details and gives examples for point c) in
Section 1.2. To be able to use IRIs, the URI corresponding to the
IRI in question has to encode original characters into octets by
using UTF-8. This can be specified for all URIs of a URI scheme or
can apply to individual URIs for schemes that do not specify how to
encode original characters. It can apply to the whole URI, or only
to some part. For background information on encoding characters into
URIs, see also Section 2.5 of [RFC3986].
- For new URI schemes, using UTF-8 is recommended in [RFC4395].
+ For new URI schemes, using UTF-8 is recommended in [RFC4395bis].
Examples where UTF-8 is already used are the URN syntax [RFC2141],
IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand,
because the HTTP URI scheme does not specify how to encode original
characters, only some HTTP URLs can have corresponding but different
IRIs.
For example, for a document with a URI of
"http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
construct a corresponding IRI (in XML notation, see Section 1.4):
"http://www.example.org/résumé.html" ("é" stands for
@@ -2438,23 +2442,25 @@
[RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397,
August 1998.
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
[RFC2640] Curtin, B., "Internationalization of the File Transfer
Protocol", RFC 2640, July 1999.
- [RFC4395] Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
- Registration Procedures for New URI Schemes", BCP 35,
- RFC 4395, February 2006.
+ [RFC4395bis]
+ Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
+ Registration Procedures for New URI/IRI Schemes",
+ draft-hansen-iri-4395bis-irireg-00 (work in progress),
+ September 2010.
[UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other
Markup Languages", Unicode Technical Report #20, World
Wide Web Consortium Note, June 2003,
.
[UTR36] Davis, M. and M. Suignard, "Unicode Security
Considerations", Unicode Technical Report #36,
August 2010, .