--- 1/draft-ietf-iri-3987bis-02.txt 2010-10-25 11:15:47.000000000 +0200 +++ 2/draft-ietf-iri-3987bis-03.txt 2010-10-25 11:15:47.000000000 +0200 @@ -1,21 +1,21 @@ Internationalized Resource M. Duerst Identifiers (iri) Aoyama Gakuin University Internet-Draft M. Suignard Obsoletes: 3987 (if approved) Unicode Consortium Intended status: Standards Track L. Masinter -Expires: April 20, 2011 Adobe - October 17, 2010 +Expires: April 28, 2011 Adobe + October 25, 2010 Internationalized Resource Identifiers (IRIs) - draft-ietf-iri-3987bis-02 + draft-ietf-iri-3987bis-03 Abstract This document defines the Internationalized Resource Identifier (IRI) protocol element, as an extension of the Uniform Resource Identifier (URI). An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). Grammar and processing rules are given for IRIs and related syntactic forms. In addition, this document provides named additional rule sets for @@ -55,21 +55,21 @@ and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. - This Internet-Draft will expire on April 20, 2011. + This Internet-Draft will expire on April 28, 2011. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -98,47 +98,47 @@ 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 6 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 9 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 10 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 3. Processing IRIs and related protocol elements . . . . . . . . 13 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 14 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 14 3.3. General percent-encoding of IRI components . . . . . . . 15 - 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 15 + 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 16 3.5. Mapping query components . . . . . . . . . . . . . . . . 17 3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 17 3.7. Converting URIs to IRIs . . . . . . . . . . . . . . . . . 17 3.7.1. Examples . . . . . . . . . . . . . . . . . . . . . . . 19 4. Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 20 4.1. Logical Storage and Visual Presentation . . . . . . . . . 21 4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 22 4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 23 4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . 23 5. Normalization and Comparison . . . . . . . . . . . . . . . . . 25 - 5.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . . 25 + 5.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . . 26 5.2. Preparation for Comparison . . . . . . . . . . . . . . . 26 5.3. Comparison Ladder . . . . . . . . . . . . . . . . . . . . 27 5.3.1. Simple String Comparison . . . . . . . . . . . . . . . 27 5.3.2. Syntax-Based Normalization . . . . . . . . . . . . . . 28 5.3.3. Scheme-Based Normalization . . . . . . . . . . . . . . 31 5.3.4. Protocol-Based Normalization . . . . . . . . . . . . . 32 - 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 32 + 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 33 6.2. Software Interfaces and Protocols . . . . . . . . . . . . 33 - 6.3. Format of URIs and IRIs in Documents and Protocols . . . 33 + 6.3. Format of URIs and IRIs in Documents and Protocols . . . 34 6.4. Use of UTF-8 for Encoding Original Characters . . . . . . 34 6.5. Relative IRI References . . . . . . . . . . . . . . . . . 36 7. Liberal handling of otherwise invalid IRIs . . . . . . . . . . 36 7.1. LEIRI processing . . . . . . . . . . . . . . . . . . . . 36 - 7.2. Web Address processing . . . . . . . . . . . . . . . . . 36 + 7.2. Web Address processing . . . . . . . . . . . . . . . . . 37 7.3. Characters not allowed in IRIs . . . . . . . . . . . . . 38 8. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 40 8.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 40 8.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 41 8.3. URI/IRI Transfer between Applications . . . . . . . . . . 42 8.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 42 8.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 43 8.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 43 8.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 44 8.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 44 @@ -215,21 +215,21 @@ discusses various forms of equivalence between IRIs. Section 6 discusses the use of IRIs in different situations. Section 8 gives additional informative guidelines. Section 10 discusses IRI-specific security considerations. 1.2. Applicability IRIs are designed to allow protocols and software that deal with URIs to be updated to handle IRIs. A "URI scheme" (as defined by [RFC3986] and registered through the IANA process defined in - [RFC4395] also serves as an "IRI scheme". Processing of IRIs is + [RFC4395bis] also serves as an "IRI scheme". Processing of IRIs is accomplished by extending the URI syntax while retaining (and not expanding) the set of "reserved" characters, such that the syntax for any URI scheme may be uniformly extended to allow non-ASCII characters. In addition, following parsing of an IRI, it is possible to construct a corresponding URI by first encoding characters outside of the allowed URI range and then reassembling the components. Practical use of IRIs forms in place of URIs forms depends on the following conditions being met: @@ -245,21 +245,21 @@ b. The protocol or format carrying the IRIs MUST have a mechanism to represent the wide range of characters used in IRIs, either natively or by some protocol- or format-specific escaping mechanism (for example, numeric character references in [XML1]). c. The URI scheme definition, if it explicitly allows a percent sign ("%") in any syntactic component, SHOULD define the interpretation of sequences of percent-encoded octets (using "%XX" hex octets) as octet from sequences of UTF-8 encoded strings; this is recommended - in the guidelines for registering new schemes, [RFC4395]. For + in the guidelines for registering new schemes, [RFC4395bis]. For example, this is the practice for IMAP URLs [RFC2192], POP URLs [RFC2384] and the URN syntax [RFC2141]). Note that use of percent-encoding may also be restricted in some situations, for example, URI schemes that disallow percent-encoding might still be used with a fragment identifier which is percent-encoded (e.g., [XPointer]). See Section 6.4 for further discussion. 1.3. Definitions The following definitions are used in this document; they follow the @@ -575,27 +575,33 @@ and IRI references (i.e., absolute or relative forms); for IRIs, some steps are scheme specific. 3.1. Converting to UCS Input that is already in a Unicode form (i.e., a sequence of Unicode characters or an octet-stream representing a Unicode-based character encoding such as UTF-8 or UTF-16) should be left as is and not normalized (see (see Section 5.3.2.2). - If the IRI or IRI reference is an octet stream in some known non- - Unicode character encoding, convert the IRI to a sequence of - characters from the UCS; this sequence SHOULD also be normalized - according to Unicode Normalization Form C (NFC, [UTR15]). In this - case, retain the original character encoding as the "document - character encoding". (DESIGN QUESTION: NOT WHAT MOST IMPLEMENTATIONS - DO, CHANGE? ) + An IRI or IRI reference is a sequence of characters from the UCS. + For IRIs that are not already in a Unicode form (as when written on + paper, read aloud, or represented in a text stream using a legacy + character encoding), convert the IRI to Unicode. Note that some + character encodings or transcriptions can be converted to or + represented by more than one sequence of Unicode characters. Ideally + the resulting IRI would use a normalized form, such as Unicode + Normalization Form C [UTR15] (see Section 5.3 Normalization and + Comparison), since that ensures a stable, consistent representation + that is most likely to produce the intended results. Implementers + and users are cautioned that, while denormalized character sequences + are valid, they might be difficult for other users or processes to + reproduce and might lead to unexpected results. In other cases (written on paper, read aloud, or otherwise represented independent of any character encoding) represent the IRI as a sequence of characters from the UCS normalized according to Unicode Normalization Form C (NFC, [UTR15]). 3.2. Parse the IRI into IRI components Parse the IRI, either as a relative reference (no scheme) or using scheme specific processing (according to the scheme given); the @@ -1535,21 +1538,21 @@ This section discusses details and gives examples for point c) in Section 1.2. To be able to use IRIs, the URI corresponding to the IRI in question has to encode original characters into octets by using UTF-8. This can be specified for all URIs of a URI scheme or can apply to individual URIs for schemes that do not specify how to encode original characters. It can apply to the whole URI, or only to some part. For background information on encoding characters into URIs, see also Section 2.5 of [RFC3986]. - For new URI schemes, using UTF-8 is recommended in [RFC4395]. + For new URI schemes, using UTF-8 is recommended in [RFC4395bis]. Examples where UTF-8 is already used are the URN syntax [RFC2141], IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, because the HTTP URI scheme does not specify how to encode original characters, only some HTTP URLs can have corresponding but different IRIs. For example, for a document with a URI of "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to construct a corresponding IRI (in XML notation, see Section 1.4): "http://www.example.org/résumé.html" ("é" stands for @@ -2438,23 +2442,25 @@ [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, August 1998. [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. [RFC2640] Curtin, B., "Internationalization of the File Transfer Protocol", RFC 2640, July 1999. - [RFC4395] Hansen, T., Hardie, T., and L. Masinter, "Guidelines and - Registration Procedures for New URI Schemes", BCP 35, - RFC 4395, February 2006. + [RFC4395bis] + Hansen, T., Hardie, T., and L. Masinter, "Guidelines and + Registration Procedures for New URI/IRI Schemes", + draft-hansen-iri-4395bis-irireg-00 (work in progress), + September 2010. [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other Markup Languages", Unicode Technical Report #20, World Wide Web Consortium Note, June 2003, . [UTR36] Davis, M. and M. Suignard, "Unicode Security Considerations", Unicode Technical Report #36, August 2010, .