--- 1/draft-ietf-iri-3987bis-09.txt 2012-03-02 13:13:58.018671504 +0100 +++ 2/draft-ietf-iri-3987bis-10.txt 2012-03-02 13:13:58.094671145 +0100 @@ -1,21 +1,21 @@ Internationalized Resource Identifiers M. Duerst (iri) Aoyama Gakuin University Internet-Draft M. Suignard Obsoletes: 3987 (if approved) Unicode Consortium Intended status: Standards Track L. Masinter -Expires: July 12, 2012 Adobe - January 9, 2012 +Expires: September 3, 2012 Adobe + March 2, 2012 Internationalized Resource Identifiers (IRIs) - draft-ietf-iri-3987bis-09 + draft-ietf-iri-3987bis-10 Abstract This document defines the Internationalized Resource Identifier (IRI) protocol element, as an extension of the Uniform Resource Identifier (URI). An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). Grammar and processing rules are given for IRIs and related syntactic forms. Defining IRI as new protocol element (rather than updating or @@ -50,21 +50,21 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on July 12, 2012. + This Internet-Draft will expire on September 3, 2012. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -95,75 +95,75 @@ 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 8 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 9 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 3. Processing IRIs and related protocol elements . . . . . . . . 13 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 13 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 13 3.3. General percent-encoding of IRI components . . . . . . . 14 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 14 3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 14 - 3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 14 + 3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 15 3.4.3. Additional Considerations . . . . . . . . . . . . . . 15 3.5. Mapping query components . . . . . . . . . . . . . . . . 16 3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 16 4. Converting URIs to IRIs . . . . . . . . . . . . . . . . . . . 16 4.1. Examples . . . . . . . . . . . . . . . . . . . . . . . . 18 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 19 5.2. Software Interfaces and Protocols . . . . . . . . . . . . 20 - 5.3. Format of URIs and IRIs in Documents and Protocols . . . 20 - 5.4. Use of UTF-8 for Encoding Original Characters . . . . . . 20 - 5.5. Relative IRI References . . . . . . . . . . . . . . . . . 22 - 6. Legacy Extended IRIs (LEIRIs) . . . . . . . . . . . . . . . . 22 + 5.3. Format of URIs and IRIs in Documents and Protocols . . . 21 + 5.4. Use of UTF-8 for Encoding Original Characters . . . . . . 21 + 5.5. Relative IRI References . . . . . . . . . . . . . . . . . 23 + 6. Legacy Extended IRIs (LEIRIs) . . . . . . . . . . . . . . . . 23 6.1. Legacy Extended IRI Syntax . . . . . . . . . . . . . . . 23 - 6.2. Conversion of Legacy Extended IRIs to IRIs . . . . . . . 23 + 6.2. Conversion of Legacy Extended IRIs to IRIs . . . . . . . 24 6.3. Characters Allowed in Legacy Extended IRIs but not in - IRIs . . . . . . . . . . . . . . . . . . . . . . . . . . 23 - 7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 25 - 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 25 + IRIs . . . . . . . . . . . . . . . . . . . . . . . . . . 24 + 7. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 26 + 7.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 26 7.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 26 - 7.3. URI/IRI Transfer between Applications . . . . . . . . . . 26 + 7.3. URI/IRI Transfer between Applications . . . . . . . . . . 27 7.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 27 - 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 27 - 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 28 - 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 28 - 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 29 - 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 - 9. Security Considerations . . . . . . . . . . . . . . . . . . . 30 - 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31 - 11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 32 + 7.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 28 + 7.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 29 + 7.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 29 + 7.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 30 + 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 + 9. Security Considerations . . . . . . . . . . . . . . . . . . . 31 + 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32 + 11. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 33 11.1. Split out Bidi, processing guidelines, comparison - sections . . . . . . . . . . . . . . . . . . . . . . . . 32 + sections . . . . . . . . . . . . . . . . . . . . . . . . 33 - 11.2. Major restructuring of IRI processing model . . . . . . . 32 - 11.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 32 - 11.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 33 - 11.2.3. Extension of Syntax . . . . . . . . . . . . . . . . . 33 - 11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 33 - 11.3. Change Log . . . . . . . . . . . . . . . . . . . . . . . 33 - 11.3.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 33 + 11.2. Major restructuring of IRI processing model . . . . . . . 33 + 11.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 33 + 11.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 34 + 11.2.3. Extension of Syntax . . . . . . . . . . . . . . . . . 34 + 11.2.4. More to be added . . . . . . . . . . . . . . . . . . . 34 + 11.3. Change Log . . . . . . . . . . . . . . . . . . . . . . . 34 + 11.3.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 34 11.3.2. Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 34 11.3.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 34 - 11.4. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 34 - 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 34 - 11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 34 - 11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 34 + 11.4. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 35 + 11.5. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 35 + 11.6. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 35 + 11.7. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 35 11.8. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 35 11.9. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 35 - 11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 35 - 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 35 - 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35 - 12.1. Normative References . . . . . . . . . . . . . . . . . . 35 - 12.2. Informative References . . . . . . . . . . . . . . . . . 36 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 39 + 11.10. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 36 + 11.11. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 36 + 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36 + 12.1. Normative References . . . . . . . . . . . . . . . . . . 36 + 12.2. Informative References . . . . . . . . . . . . . . . . . 37 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 40 1. Introduction 1.1. Overview and Motivation A Uniform Resource Identifier (URI) is defined in [RFC3986] as a sequence of characters chosen from a limited subset of the repertoire of US-ASCII [ASCII] characters. The characters in URIs are frequently used for representing words of @@ -460,21 +460,21 @@ ipath = ipath-abempty ; begins with "/" or is empty / ipath-absolute ; begins with "/" but not "//" / ipath-noscheme ; begins with a non-colon segment / ipath-rootless ; begins with a segment / ipath-empty ; zero characters ipath-abempty = *( path-sep isegment ) ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ] ipath-noscheme = isegment-nz-nc *( path-sep isegment ) ipath-rootless = isegment-nz *( path-sep isegment ) - ipath-empty = 0 + ipath-empty = "" path-sep = "/" isegment = *ipchar isegment-nz = 1*ipchar isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims / "@" ) ; non-zero-length segment without any colon ":" ipchar = iunreserved / pct-form / sub-delims / ":" / "@" @@ -606,38 +606,44 @@ is the hexadecimal notation of the octet value. The hexadecimal notation SHOULD use uppercase letters. (This is the general URI percent-encoding mechanism in Section 2.1 of [RFC3986].) Note that the mapping is an identity transformation for parsed URI components of valid URIs, and is idempotent: applying the mapping a second time will not change anything. 3.4. Mapping ireg-name + The mapping from to a requires a choice + between one of the two methods described below. + 3.4.1. Mapping using Percent-Encoding The ireg-name component SHOULD be converted according to the general procedure for percent-encoding of IRI components described in Section 3.3. For example, the IRI "http://résumé.example.org" will be converted to "http://r%C3%A9sum%C3%A9.example.org". This conversion for ireg-name is in line with Section 3.2.2 of [RFC3986], which does not mandate a particular registered name lookup technology. For further background, see [RFC6055] and [Gettys]. 3.4.2. Mapping using Punycode - The ireg-name component MAY also be converted as follows: + In situations where it is certain that is intended to be + used as a domain name to be processed by Domain Name Lookup (as per + [RFC5891]), an alternative method MAY be used, converting + as follows: If there are any sequences of , and their corresponding octets all represent valid UTF-8 octet sequences, then convert these back to Unicode character sequences. (If any sequences are not valid UTF-8 octet sequences, then leave the entire field as is without any change, since punycode encoding would not succeed.) Replace the ireg-name part of the IRI by the part converted using the Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891]. on each dot-separated label, and by using U+002E (FULL STOP) as a @@ -865,20 +870,34 @@ cannot and should not check for such limitations.) b. The UCS contains many areas of characters for which there are strong visual look-alikes. Because of the likelihood of transcription errors, these also should be avoided. This includes the full-width equivalents of Latin characters, half-width Katakana characters for Japanese, and many others. It also includes many look-alikes of "space", "delims", and "unwise", characters excluded in [RFC3491]. + c. At the start of a component, the use of combining marks is + strongly discouraged. As an example, a COMBINING TILDE OVERLAY + (U+0334) would be very confusing at the start of a . + Combined with the preceeding '/', it might look like a solidus + with combining tilde overlay, but IRI processing software will + parse and process the '/' separately. + + d. The ZERO WIDTH NON-JOINER (U+200C) and ZERO WIDTH JOINER (U+200D) + are invisible in most contexts, but are crucial in some very + limited contexts. Appendix A of [RFC5892] contains contextual + restrictions for these and some other characters. The use of + these characters are strongly discouraged except in the relevant + contexts. + Additional information is available from [UNIXML]. [UNIXML] is written in the context of running text rather than in that of identifiers. Nevertheless, it discusses many of the categories of characters not appropriate for IRIs. 5.2. Software Interfaces and Protocols Although an IRI is defined as a sequence of characters, software interfaces for URIs typically function on sequences of octets or other kinds of code units. Thus, software interfaces and protocols @@ -994,90 +1013,87 @@ 5.5. Relative IRI References Processing of relative IRI references against a base is handled straightforwardly; the algorithms of [RFC3986] can be applied directly, treating the characters additionally allowed in IRI references in the same way that unreserved characters are in URI references. 6. Legacy Extended IRIs (LEIRIs) - For historic reasons, some formats have allowed variants of IRIs that - are somewhat less restricted in syntax. This section provides a - definition and a name (Legacy Extended IRI or LEIRI) for these - variants for easier reference. These variants have to be used with - care; they require further processing before being fully - interchangeable as IRIs. New protocols and formats SHOULD NOT use - Legacy Extended IRIs. Even where Legacy Extended IRIs are allowed, - only IRIs fully conforming to the syntax definition in Section 2.2 - SHOULD be created, generated, and used. The provisions in this - section also apply to Legacy Extended IRI references. + In some cases, there have been formats which have used a protocol + element which is a variant of the IRI definition; these variants have + usually been somewhat less restricted in syntax. This section + provides a definition and a name (Legacy Extended IRI or LEIRI) for + one of these variants used widely in XML-based protocols. This + variant has to be used with care; it requires further processing + before being fully interchangeable as IRIs. New protocols and + formats SHOULD NOT use Legacy Extended IRIs. Even where Legacy + Extended IRIs are allowed, only IRIs fully conforming to the syntax + definition in Section 2.2 SHOULD be created, generated, and used. + The provisions in this section also apply to Legacy Extended IRI + references. 6.1. Legacy Extended IRI Syntax - The syntax of Legacy Extended IRIs is the same as that for IRIs, - except that ucschar is redefined as follows: + This section defines Legacy Extended IRIs (LEIRIs). The syntax of + Legacy Extended IRIs is the same as that for , except + that the ucschar production is replaced by the leiri-ucschar + production: - ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" + leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" / "\" / "^" / "`" / %x0-1F / %x7F-D7FF / %xE000-FFFD / %x10000-10FFFF The restriction on bidirectional formatting characters in [Bidi] is lifted. The iprivate production becomes redundant. Likewise, the syntax for Legacy Extended IRI references (LEIRI references) is the same as that for IRI references with the above - redefinition of ucschar applied. - - Formats that use Legacy Extended IRIs or Legacy Extended IRI - references MAY further restrict the characters allowed therein, - either implicitly by the fact that the format as such does not allow - some characters, or explicitly. An example of a character not - allowed implicitly may be the NUL character (U+0000). However, all - the characters allowed in IRIs MUST still be allowed. + replacement of ucschar with leiri-ucschar. 6.2. Conversion of Legacy Extended IRIs to IRIs To convert a Legacy Extended IRI (reference) to an IRI (reference), each character allowed in a Legacy Extended IRI (reference) but not allowed in an IRI (reference) (see Section 6.3) MUST be percent- - encoded by applying steps 2.1 to 2.3 of Section 3.6. + encoded by applying the steps in Section 3.3. 6.3. Characters Allowed in Legacy Extended IRIs but not in IRIs This section provides a list of the groups of characters and code points that are allowed in Legacy Extedend IRIs, but are not allowed in IRIs or are allowed in IRIs only in the query part. For each group of characters, advice on the usage of these characters is also given, concentrating on the reasons for why not to use them. Space (U+0020): Some formats and applications use space as a - delimiter, e.g. for items in a list. Appendix C of [RFC3986] also - mentions that white space may have to be added when displaying or - printing long URIs; the same applies to long IRIs. This means - that spaces can disappear, or can make the Legacy Extended IRI to - be interpreted as two or more separate IRIs. + delimiter, e.g., for items in a list. Appendix C of [RFC3986] + also mentions that white space may have to be added when + displaying or printing long URIs; the same applies to long IRIs. + Spaces might disappear, or a single Legacy Extended IRI might + incorrectly be interpreted as two or more separate ones. Delimiters "<" (U+003C), ">" (U+003E), and '"' (U+0022): Appendix C of [RFC3986] suggests the use of double-quotes ("http://example.com/") and angle brackets () as delimiters for URIs in plain text. These conventions are often used, and also apply to IRIs. Legacy Extended IRIs using these - characters will be cut off at the wrong place. + characters might be cut off at the wrong place. Unwise characters "\" (U+005C), "^" (U+005E), "`" (U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): These characters - originally have been excluded from URIs because the respective + originally were excluded from URIs because the respective codepoints are assigned to different graphic characters in some 7-bit or 8-bit encoding. Despite the move to Unicode, some of these characters are still occasionally displayed differently on - some systems, e.g. U+005C as a Japanese Yen symbol. Also, the + some systems, e.g., U+005C as a Japanese Yen symbol. Also, the fact that these characters are not used in URIs or IRIs has encouraged their use outside URIs or IRIs in contexts that may include URIs or IRIs. In case a Legacy Extended IRI with such a character is used in such a context, the Legacy Extended IRI will be interpreted piecemeal. The controls (C0 controls, DEL, and C1 controls, #x0 - #x1F #x7F - #x9F): There is no way to transmit these characters reliably except potentially in electronic form. Even when in electronic form, some software components might silently filter out some of @@ -1104,37 +1120,40 @@ Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000- 10FFFD): Display and interpretation of these code points is by definition undefined without private agreement. Therefore, these code points are not suited for use on the Internet. They are not interoperable and may have unpredictable effects. Tags (U+E0000-E0FFF): These characters provide a way to language tag in Unicode plain text. They are not appropriate for Legacy Extended IRIs because language information in identifiers cannot - reliably be input, transmitted (e.g. on a visual medium such as + reliably be input, transmitted (e.g., on a visual medium such as paper), or recognized. Non-characters (U+FDD0-FDEF, U+1FFFE-1FFFF, U+2FFFE-2FFFF, U+3FFFE-3FFFF, U+4FFFE-4FFFF, U+5FFFE-5FFFF, U+6FFFE-6FFFF, U+7FFFE-7FFFF, U+8FFFE-8FFFF, U+9FFFE-9FFFF, U+AFFFE-AFFFF, U+BFFFE-BFFFF, U+CFFFE-CFFFF, U+DFFFE-DFFFF, U+EFFFE-EFFFF, U+FFFFE-FFFFF, U+10FFFE-10FFFF): These code points are defined as non-characters. Applications may use some of them internally, but are not prepared to interchange them. For reference, we here also list the code points and code units not even allowed in Legacy Extended IRIs: Surrogate code units (D800-DFFF): These do not represent Unicode codepoints. + Non-characters (U+FFFE-FFFF): These are not allowed in XML nor + LEIRIs. + 7. URI/IRI Processing Guidelines (Informative) This informative section provides guidelines for supporting IRIs in the same software components and operations that currently process URIs: Software interfaces that handle URIs, software that allows users to enter URIs, software that creates or generates URIs, software that displays URIs, formats and protocols that transport URIs, and software that interprets URIs. These may all require modification before functioning properly with IRIs. The considerations in this section also apply to URI references and IRI @@ -1370,34 +1389,44 @@ the character encoding) and will therefore be compatible with IRIs. These recommendations, when taken together, will allow for the extension from URIs to IRIs in order to handle characters other than US-ASCII while minimizing interoperability problems. For considerations regarding the upgrade of URI scheme definitions, see Section 5.4. 8. IANA Considerations + NOTE: THIS SECTION NEEDS REVIEW AGAINST HAPPIANA WORK. + RFC Editor and IANA note: Please Replace RFC XXXX with the number of - this document when it issues as an RFC. + this document when it issues as an RFC, and RFC YYYY with the number + of the RFC issued for draft-ietf-iri-rfc3987bis. - IANA maintains a registry of "URI schemes". A "URI scheme" also - serves an "IRI scheme". + IANA maintains a registry of "URI schemes". This document attempts + to make it clear from the registry that a "URI scheme" also serves an + "IRI scheme", and makes several changes to the registry. - To clarify that the URI scheme registration process also applies to - IRIs, change the description of the "URI schemes" registry header to - say "[RFC4395] defines an IANA-maintained registry of URI Schemes. - These registries include the Permanent and Provisional URI Schemes. - RFC XXXX updates this registry to designate that schemes may also - indicate their usability as IRI schemes. + The description of the registry should be changed: "RFC 4395 defined + an IANA-maintained registry of URI Schemes. RFC XXXX updates this + registry to make it clear that the registered values also serve as + IRI schemes, as defined in RFC YYYY." - Update "per RFC 4395" to "per RFC 4395 and RFC XXXX". + The registry includes schemes marked as Permanent or Provisional. + Previously, this was accomplished by having two sections, "Permanent" + and "Provisional". However, in order to allow other status + ("Historical", and possibly a Proposed status for proposals which + have been received but not accepted), the registry should be changed + so that the status is indicated in a separate "Status" column, whose + values may be "Permanent", "Provisional" or "Historical". Changes in + status as well as updates to the entire registration may be + accomplished by requests and expert review. 9. Security Considerations The security considerations discussed in [RFC3986] also apply to IRIs. In addition, the following issues require particular care for IRIs. Incorrect encoding or decoding can lead to security problems. For example, some UTF-8 decoders do not check against overlong byte sequences. See [UTR36] Section 3 for details. @@ -1642,20 +1671,24 @@ Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [RFC5890] Klensin, J., "Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework", RFC 5890, August 2010. [RFC5891] Klensin, J., "Internationalized Domain Names in Applications (IDNA): Protocol", RFC 5891, August 2010. + [RFC5892] Faltstrom, P., "The Unicode Code Points and + Internationalized Domain Names for Applications (IDNA)", + RFC 5892, August 2010. + [STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, January 2008. [UNIV6] The Unicode Consortium, "The Unicode Standard, Version 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, ISBN 978-1-936213-01-6)", October 2010. [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", Unicode Standard Annex #15, March 2008,