--- 1/draft-ietf-iri-3987bis-04.txt 2011-03-30 07:16:23.000000000 +0200 +++ 2/draft-ietf-iri-3987bis-05.txt 2011-03-30 07:16:23.000000000 +0200 @@ -1,21 +1,21 @@ Internationalized Resource M. Duerst Identifiers (iri) Aoyama Gakuin University Internet-Draft M. Suignard Obsoletes: 3987 (if approved) Unicode Consortium Intended status: Standards Track L. Masinter -Expires: September 15, 2011 Adobe - March 14, 2011 +Expires: September 30, 2011 Adobe + March 29, 2011 Internationalized Resource Identifiers (IRIs) - draft-ietf-iri-3987bis-04 + draft-ietf-iri-3987bis-05 Abstract This document defines the Internationalized Resource Identifier (IRI) protocol element, as an extension of the Uniform Resource Identifier (URI). An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). Grammar and processing rules are given for IRIs and related syntactic forms. In addition, this document provides named additional rule sets for @@ -33,20 +33,22 @@ related protocol elements when revising protocols, formats, and software components that currently deal only with URIs. RFC Editor: Please remove the next paragraph before publication. This document is intended to update RFC 3987 and move towards IETF Draft Standard. For discussion and comments on this draft, please join the IETF IRI WG by subscribing to the mailing list public-iri@w3.org. For a list of open issues, please see the issue tracker of the WG at http://trac.tools.ietf.org/wg/iri/trac/report/1. + For a list of individual edits, please see the change history at + http://trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. @@ -55,21 +57,21 @@ and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. - This Internet-Draft will expire on September 15, 2011. + This Internet-Draft will expire on September 30, 2011. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -98,21 +100,24 @@ 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 9 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 10 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 3. Processing IRIs and related protocol elements . . . . . . . . 13 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 14 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 14 3.3. General percent-encoding of IRI components . . . . . . . 15 - 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 16 + 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 15 + 3.4.1. Mapping using Percent-Encoding . . . . . . . . . . . . 15 + 3.4.2. Mapping using Punycode . . . . . . . . . . . . . . . . 16 + 3.4.3. Additional Considerations . . . . . . . . . . . . . . 16 3.5. Mapping query components . . . . . . . . . . . . . . . . 17 3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . 17 3.7. Converting URIs to IRIs . . . . . . . . . . . . . . . . . 17 3.7.1. Examples . . . . . . . . . . . . . . . . . . . . . . . 19 4. Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 20 4.1. Logical Storage and Visual Presentation . . . . . . . . . 21 4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 22 4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 23 4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . 23 5. Normalization and Comparison . . . . . . . . . . . . . . . . . 25 @@ -122,59 +127,59 @@ 5.3.1. Simple String Comparison . . . . . . . . . . . . . . . 27 5.3.2. Syntax-Based Normalization . . . . . . . . . . . . . . 28 5.3.3. Scheme-Based Normalization . . . . . . . . . . . . . . 31 5.3.4. Protocol-Based Normalization . . . . . . . . . . . . . 32 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 33 6.2. Software Interfaces and Protocols . . . . . . . . . . . . 33 6.3. Format of URIs and IRIs in Documents and Protocols . . . 34 6.4. Use of UTF-8 for Encoding Original Characters . . . . . . 34 6.5. Relative IRI References . . . . . . . . . . . . . . . . . 36 - 7. Liberal handling of otherwise invalid IRIs . . . . . . . . . . 36 - 7.1. LEIRI processing . . . . . . . . . . . . . . . . . . . . 36 - 7.2. Web Address processing . . . . . . . . . . . . . . . . . 37 - 7.3. Characters not allowed in IRIs . . . . . . . . . . . . . 38 + 7. Liberal Handling of Otherwise Invalid IRIs . . . . . . . . . . 36 + 7.1. LEIRI Processing . . . . . . . . . . . . . . . . . . . . 36 + 7.2. Web Address Processing . . . . . . . . . . . . . . . . . 37 + 7.3. Characters Not Allowed in IRIs . . . . . . . . . . . . . 38 8. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 40 8.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 40 8.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 41 8.3. URI/IRI Transfer between Applications . . . . . . . . . . 42 8.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 42 8.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 43 8.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 43 8.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 44 8.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 44 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 45 10. Security Considerations . . . . . . . . . . . . . . . . . . . 46 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 47 - 12. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 48 - 12.1. Major restructuring of IRI processing model . . . . . . . 48 + 12. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 47 + 12.1. Major restructuring of IRI processing model . . . . . . . 47 12.1.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 48 - 12.1.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 49 - 12.1.3. Extension of Syntax . . . . . . . . . . . . . . . . . 49 + 12.1.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 48 + 12.1.3. Extension of Syntax . . . . . . . . . . . . . . . . . 48 12.1.4. More to be added . . . . . . . . . . . . . . . . . . . 49 12.2. Change Log . . . . . . . . . . . . . . . . . . . . . . . 49 12.2.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 49 12.2.2. Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 49 12.2.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 49 - 12.3. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 50 - 12.4. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 50 + 12.3. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 49 + 12.4. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 49 12.5. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 50 12.6. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 50 12.7. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 50 12.8. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 50 - 12.9. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 51 - 12.10. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 51 + 12.9. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 50 + 12.10. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 50 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 51 13.1. Normative References . . . . . . . . . . . . . . . . . . 51 13.2. Informative References . . . . . . . . . . . . . . . . . 52 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 55 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 54 1. Introduction 1.1. Overview and Motivation A Uniform Resource Identifier (URI) is defined in [RFC3986] as a sequence of characters chosen from a limited subset of the repertoire of US-ASCII [ASCII] characters. The characters in URIs are frequently used for representing words of @@ -444,21 +449,22 @@ The following grammar closely follows the URI grammar in [RFC3986], except that the range of unreserved characters is expanded to include UCS characters, with the restriction that private UCS characters can occur only in query parts. The grammar is split into two parts: Rules that differ from [RFC3986] because of the above-mentioned expansion, and rules that are the same as those in [RFC3986]. For rules that are different than those in [RFC3986], the names of the non-terminals have been changed as follows. If the non-terminal contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i' - has been prefixed. + has been prefixed. The rule has been introduced in order + to be able to reference it from other parts of the document. The following rules are different from those in [RFC3986]: IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ] ihier-part = "//" iauthority ipath-abempty / ipath-absolute / ipath-rootless / ipath-empty @@ -492,29 +498,28 @@ ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ] ipath-noscheme = isegment-nz-nc *( path-sep isegment ) ipath-rootless = isegment-nz *( path-sep isegment ) ipath-empty = 0 path-sep = "/" isegment = *ipchar isegment-nz = 1*ipchar isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims / "@" ) - ; non-zero-length segment without any colon ":" ipchar = iunreserved / pct-form / sub-delims / ":" / "@" iquery = *( ipchar / iprivate / "/" / "?" ) - ifragment = *( ipchar / "/" / "?" / "#" ) + ifragment = *( ipchar / "/" / "?" ) iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD / %xD0000-DFFFD / %xE1000-EFFFD @@ -579,21 +584,21 @@ which establish the relationship between the string given and the interpreted derivatives. These processing steps apply to both IRIs and IRI references (i.e., absolute or relative forms); for IRIs, some steps are scheme specific. 3.1. Converting to UCS Input that is already in a Unicode form (i.e., a sequence of Unicode characters or an octet-stream representing a Unicode-based character encoding such as UTF-8 or UTF-16) should be left as is and not - normalized (see (see Section 5.3.2.2). + normalized (see Section 5.3.2.2). An IRI or IRI reference is a sequence of characters from the UCS. For IRIs that are not already in a Unicode form (as when written on paper, read aloud, or represented in a text stream using a legacy character encoding), convert the IRI to Unicode. Note that some character encodings or transcriptions can be converted to or represented by more than one sequence of Unicode characters. Ideally the resulting IRI would use a normalized form, such as Unicode Normalization Form C [UTR15] (see Section 5.3 Normalization and Comparison), since that ensures a stable, consistent representation @@ -604,97 +609,112 @@ In other cases (written on paper, read aloud, or otherwise represented independent of any character encoding) represent the IRI as a sequence of characters from the UCS normalized according to Unicode Normalization Form C (NFC, [UTR15]). 3.2. Parse the IRI into IRI components Parse the IRI, either as a relative reference (no scheme) or using scheme specific processing (according to the scheme given); the - result resulting in a set of parsed IRI components. (NOTE: FIX - BEFORE RELEASE: INTENT IS THAT ALL IRI SCHEMES THAT USE GENERIC - SYNTAX AND ALLOW NON-ASCII AUTHORITY CAN ONLY USE AUTHORITY FOR NAMES - THAT FOLLOW PUNICODE.) + result is a set of parsed IRI components. - NOTE: The result of parsing into components will correspond result in - a correspondence of subtrings of the IRI according to the part - matched. For example, in [HTML5], the protocol components of - interest are SCHEME (scheme), HOST (ireg-name), PORT (port), the PATH - (ipath after the initial "/"), QUERY (iquery), FRAGMENT (ifragment), - and AUTHORITY (iauthority). + NOTE: The result of parsing into components will correspond to + subtrings of the IRI that may be accessible via an API. For example, + in [HTML5], the protocol components of interest are SCHEME (scheme), + HOST (ireg-name), PORT (port), the PATH (ipath after the initial + "/"), QUERY (iquery), FRAGMENT (ifragment), and AUTHORITY + (iauthority). Subsequent processing rules are sometimes used to define other syntactic components. For example, [HTML5] defines APIs for IRI processing; in these APIs: HOSTSPECIFIC the substring that follows the substring matched by the iauthority production, or the whole string if the iauthority production wasn't matched. HOSTPORT if there is a scheme component and a port component and the port given by the port component is different than the default port defined for the protocol given by the scheme component, then HOSTPORT is the substring that starts with the substring matched by the host production and ends with the substring matched by the port production, and includes the colon in between the two. Otherwise, it is the same as the host component. 3.3. General percent-encoding of IRI components - For most IRI components, it is possible to map the IRI component to - an equivalent URI component by percent-encoding those characters not - allowed in URIs. Previous processing steps will have removed some - characters, and the interpretation of reserved characters will have - already been done (with the syntactic reserved characters outside of - the IRI component). This mapping is defined for all sequences of - Unicode characters, whether or not they are valid for the component - in question. + Except as noted in the following subsections, IRI components are + mapped to the equivalent URI components by percent-encoding those + characters not allowed in URIs. Previous processing steps will have + removed some characters, and the interpretation of reserved + characters will have already been done (with the syntactic reserved + characters outside of the IRI component). This mapping is defined + for all sequences of Unicode characters, whether or not they are + valid for the component in question. - For each character which is not allowed in a valid URI (NOTE: WHAT IS - THE RIGHT REFERENCE HERE), apply the following steps. + For each character which is not allowed anywhere in a valid URI, + apply the following steps. Convert to UTF-8 Convert the character to a sequence of one or more octets using UTF-8 [RFC3629]. Percent encode Convert each octet of this sequence to %HH, where HH is the hexadecimal notation of the octet value. The hexadecimal notation SHOULD use uppercase letters. (This is the general URI percent-encoding mechanism in Section 2.1 of [RFC3986].) Note that the mapping is an identity transformation for parsed URI components of valid URIs, and is idempotent: applying the mapping a second time will not change anything. 3.4. Mapping ireg-name - Schemes that allow non-ASCII based characters in the reg-name (ireg- - name) position MUST convert the ireg-name component of an IRI as - follows: +3.4.1. Mapping using Percent-Encoding + + The ireg-name component SHOULD be converted according to the general + procedure for percent-encoding of IRI components described in + Section 3.3. + + For example, the IRI + "http://résumé.example.org" + will be converted to + "http://r%C3%A9sum%C3%A9.example.org". + + This conversion for ireg-name is in line with Section 3.2.2 of + [RFC3986], which does not mandate a particular registered name lookup + technology. For further background, see [RFC6055] and [Gettys]. + +3.4.2. Mapping using Punycode + + The ireg-name component MAY also be converted as follows: Replace the ireg-name part of the IRI by the part converted using the - ToASCII operation specified in Section 4.1 of [RFC3490] on each dot- - separated label, and by using U+002E (FULL STOP) as a label - separator, with the flag UseSTD3ASCIIRules set to FALSE, and with the - flag AllowUnassigned set to FALSE. The ToASCII operation may fail, - but this would mean that the IRI cannot be resolved. In such cases, - if the domain name conversion fails, then the entire IRI conversion - fails. Processors that have no mechanism for signalling a failure - MAY instead substitute an otherwise invalid host name, although such - processing SHOULD be avoided. + Domain Name Lookup procedure (Subsections 5.3 to 5.5) of [RFC5891]. + on each dot-separated label, and by using U+002E (FULL STOP) as a + label separator. This procedure may fail, but this would mean that + the IRI cannot be resolved. In such cases, if the domain name + conversion fails, then the entire IRI conversion fails. Processors + that have no mechanism for signalling a failure MAY instead + substitute an otherwise invalid host name, although such processing + SHOULD be avoided. For example, the IRI "http://résumé.example.org" MAY be converted to "http://xn--rsum-bad.example.org" - ; conversion to percent-encoded form, e.g., - "http://r%C3%A9sum%C3%A9.example.org", MUST NOT be performed. + . + + This conversion for ireg-name will be better able to deal with legacy + infrastructure that cannot handle percent-encoding in domain names. + +3.4.3. Additional Considerations Note: Domain Names may appear in parts of an IRI other than the ireg-name part. It is the responsibility of scheme-specific implementations (if the Internationalized Domain Name is part of the scheme syntax) or of server-side implementations (if the Internationalized Domain Name is part of 'iquery') to apply the necessary conversions at the appropriate point. Example: Trying to validate the Web page at http://résumé.example.org would lead to an IRI of http://validator.w3.org/check?uri=http%3A%2F%2Frésumé. @@ -707,23 +727,20 @@ Note: In this process, characters allowed in URI references and existing percent-encoded sequences are not encoded further. (This mapping is similar to, but different from, the encoding applied when arbitrary content is included in some part of a URI.) For example, an IRI of "http://www.example.org/red%09rosé#red" (in XML notation) is converted to "http://www.example.org/red%09ros%C3%A9#red", not to something like "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red". - ((DESIGN QUESTION: What about e.g. - http://r%C3%A9sum%C3%A9.example.org in an IRI? Will that get - converted to punycode, or not?)) 3.5. Mapping query components ((NOTE: SEE ISSUES LIST)) For compatibility with existing deployed HTTP infrastructure, the following special case applies for schemes "http" and "https" and IRIs whose origin has a document charset other than one which is UCS-based (e.g., UTF-8 or UTF-16). In such a case, the "query" component of an IRI is mapped into a URI by using the document charset rather than UTF-8 as the binary representation before pct-encoding. This mapping is not applied for any other @@ -1346,21 +1365,21 @@ to the uppercase/lowercase problems. Some parts of a URI are case insensitive (for example, the domain name). For others, it is unclear whether they are case sensitive, case insensitive, or something in between (e.g., case sensitive, but with a multiple choice selection if the wrong case is used, instead of a direct negative result). The best recipe is that the creator use a reasonable capitalization and, when transferring the URI, capitalization never be changed. Various IRI schemes may allow the usage of Internationalized Domain - Names (IDN) [RFC3490] either in the ireg-name part or elsewhere. + Names (IDN) [RFC5890] either in the ireg-name part or elsewhere. Character Normalization also applies to IDNs, as discussed in Section 5.3.3. 5.3.2.3. Percent-Encoding Normalization The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a frequent source of variance among otherwise identical IRIs. In addition to the case normalization issue noted above, some IRI producers percent-encode octets that do not require percent-encoding, resulting in IRIs that are equivalent to their nonencoded @@ -1620,57 +1639,57 @@ used if the query part is encoded in UTF-8. 6.5. Relative IRI References Processing of relative IRI references against a base is handled straightforwardly; the algorithms of [RFC3986] can be applied directly, treating the characters additionally allowed in IRI references in the same way that unreserved characters are in URI references. -7. Liberal handling of otherwise invalid IRIs +7. Liberal Handling of Otherwise Invalid IRIs (EDITOR NOTE: This Section may move to an appendix.) Some technical specifications and widely-deployed software have allowed additional variations and extensions of IRIs to be used in syntactic components. This section describes two widely-used preprocessing agreements. Other technical specifications may wish to reference a syntactic component which is "a valid IRI or a string that will map to a valid IRI after this preprocessing algorithm". These two variants are known as Legacy Extended IRI or LEIRI [LEIRI], and Web Address [HTML5]). Future technical specifications SHOULD NOT allow conforming producers to produce, or conforming content to contain, such forms, as they are not interoperable with other IRI consuming software. -7.1. LEIRI processing +7.1. LEIRI Processing This section defines Legacy Extended IRIs (LEIRIs). The syntax of Legacy Extended IRIs is the same as that for , except that the ucschar production is replaced by the leiri-ucschar production: leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" / "\" / "^" / "`" / %x0-1F / %x7F-D7FF / %xE000-FFFD / %x10000-10FFFF Among other extensions, processors based on this specification also did not enforce the restriction on bidirectional formatting characters in Section 4.1, and the iprivate production becomes redundant. To convert a string allowed as a LEIRI to an IRI, each character allowed in leiri-ucschar but not in ucschar must be percent-encoded using Section 3.3. -7.2. Web Address processing +7.2. Web Address Processing Many popular web browsers have taken the approach of being quite liberal in what is accepted as a "URL" or its relative forms. This section describes their behavior in terms of a preprocessor which maps strings into the IRI space for subsequent parsing and interpretation as an IRI. In some situations, it might be appropriate to describe the syntax that a liberal consumer implementation might accept as a "Web Address" or "Hypertext Reference" or "HREF". However, technical @@ -1702,29 +1721,30 @@ has a Document, and the HRef-charset is the Document's character encoding. If the string had a HRef-charset defined when the string was created or defined The HRef-charset is as defined. If the resulting HRef-charset is a unicode based character encoding (e.g., UTF-16), then use UTF-8 instead. The syntax for Web Addresses is obtained by replacing the 'ucschar', - pct-form, and path-sep rules with the href-ucschar, href-pct-form, - and href-path-sep rules below. In addition, some characters are - stripped. + pct-form, path-sep, and ifragment rules with the href-ucschar, href- + pct-form, href-path-sep, and href-ifragment rules below. In + addition, some characters are stripped. href-ucschar = " " / "<" / ">" / DQUOTE / "{" / "}" / "|" / "\" / "^" / "`" / %x0-1F / %x7F-D7FF / %xE000-FFFD / %x10000-10FFFF href-pct-form = pct-encoded / "%" href-path-sep = "/" / "\" + href-ifragment = *( ipchar / "/" / "?" / "#" ) ; adding "#" href-strip = (NOTE: NEED TO FIX THESE SETS TO MATCH HTML5; NOT SURE ABOUT NEXT SENTENCE) browsers did not enforce the restriction on bidirectional formatting characters in Section 4.1, and the iprivate production becomes redundant. 'Web Address processing' requires the following additional preprocessing steps: @@ -1734,21 +1754,21 @@ 2. strip all characters in href-strip. 3. Percent-encode all characters in href-ucschar not in ucschar. 4. Replace occurrences of "%" not followed by two hexadecimal digits by "%25". 5. Convert backslashes ('\') matching href-path-sep to forward slashes ('/'). -7.3. Characters not allowed in IRIs +7.3. Characters Not Allowed in IRIs This section provides a list of the groups of characters and code points that are allowed by LEIRI or HREF but are not allowed in IRIs or are allowed in IRIs only in the query part. For each group of characters, advice on the usage of these characters is also given, concentrating on the reasons for why they are excluded from IRI use. Space (U+0020): Some formats and applications use space as a delimiter, e.g. for items in a list. Appendix C of [RFC3986] also mentions that white space may have to be added when displaying or @@ -2094,84 +2114,61 @@ indicate their usability as IRI schemes. Update "per RFC 4395" to "per RFC 4395 and RFC XXXX". 10. Security Considerations The security considerations discussed in [RFC3986] also apply to IRIs. In addition, the following issues require particular care for IRIs. - Incorrect encoding or decoding can lead to security problems. In - particular, some UTF-8 decoders do not check against overlong byte - sequences. As an example, a "/" is encoded with the byte 0x2F both - in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly - interpret the sequence 0xC0 0xAF as a "/". A sequence such as - "%C0%AF.." may pass some security tests and then be interpreted as - "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion - and checking are not done in the right order, and/or if reserved - characters and unreserved characters are not clearly distinguished. - - There are various ways in which "spoofing" can occur with IRIs. - "Spoofing" means that somebody may add a resource name that looks the - same or similar to the user, but that points to a different resource. - The added resource may pretend to be the real resource by looking - very similar but may contain all kinds of changes that may be - difficult to spot and that can cause all kinds of problems. Most - spoofing possibilities for IRIs are extensions of those for URIs. - - Spoofing can occur for various reasons. First, a user's - normalization expectations or actual normalization when entering an - IRI or transcoding an IRI from a legacy character encoding do not - match the normalization used on the server side. Conceptually, this - is no different from the problems surrounding the use of case- - insensitive web servers. For example, a popular web page with a - mixed-case name ("http://big.example.com/PopularPage.html") might be - "spoofed" by someone who is able to create - "http://big.example.com/popularpage.html". However, the use of - unnormalized character sequences, and of additional mappings for user - convenience, may increase the chance for spoofing. Protocols and - servers that allow the creation of resources with names that are not - normalized are particularly vulnerable to such attacks. This is an - inherent security problem of the relevant protocol, server, or - resource and is not specific to IRIs, but it is mentioned here for - completeness. - - Spoofing can occur in various IRI components, such as the domain name - part or a path part. For considerations specific to the domain name - part, see [RFC3491]. For the path part, administrators of sites that - allow independent users to create resources in the same sub area may - have to be careful to check for spoofing. + Incorrect encoding or decoding can lead to security problems. For + example, some UTF-8 decoders do not check against overlong byte + sequences. See [UTR36] Section 3 for details. - Spoofing can occur because in the UCS many characters look very - similar. Details are discussed in Section 8.5. Again, this is very - similar to spoofing possibilities on US-ASCII, e.g., using "br0ken" - or "1ame" URIs. + There are serious difficulties with relying on a human to verify that + a an IRI (whether presented visually or aurally) is the same as + another IRI or is the one intended. These problems exist with ASCII- + only URIs (bl00mberg.com vs. bloomberg.com) but are strongly + exacerbated when using the much larger character repertoire of + Unicode. For details, see Section 2 of [UTR36]. Using + administrative and technical means to reduce the availability of such + exploits is possible, but they are difficult to eliminate altogether. + User agents SHOULD NOT rely on visual or perceptual comparison or + verification of IRIs as a means of validating or assuring safety, + correctness or appropriateness of an IRI. Other means of presenting + users with the validity, safety, or appropriateness of visited sites + are being developed in the browser community as an alternative means + of avoiding these difficulties. - Spoofing can occur when URIs with percent-encodings based on various - character encodings are accepted to deal with older user agents. In - some cases, particularly for Latin-based resource names, this is - usually easy to detect because UTF-8-encoded names, when interpreted - and viewed as legacy character encodings, produce mostly garbage. + Besides the large character repertoire of Unicode, reasons for + confusion include different forms of normalization and different + normalization expectations, use of percent-encoding with various + legacy encodings, and bidirectionality issues. See also [UTR36]. - When concurrently used character encodings have a similar structure - but there are no characters that have exactly the same encoding, - detection is more difficult. + Confusion can occur in various IRI components, such as the domain + name part or the path part, or between IRI components. For + considerations specific to the domain name part, see [RFC5890]. For + considerations specific to particular protocols or schemes, see the + security sections of the relevant specifications and registration + templates. Administrators of sites that allow independent users to + create resources in the same sub area have to be careful. Details + are discussed in Section 8.5. - Spoofing can occur with bidirectional IRIs, if the restrictions in + Confusion can occur with bidirectional IRIs, if the restrictions in Section 4.2 are not followed. The same visual representation may be interpreted as different logical representations, and vice versa. It is also very important that a correct Unicode bidirectional implementation be used. - The use of Legacy Extended IRIs introduces additional security - issues. + The characters additionally allowed in Legacy Extended IRIs introduce + additional security issues. For details, see Section 7.3. 11. Acknowledgements This document was derived from [RFC3987]; the acknowledgments from that specification still apply. We would like to thank Ian Hickson, Michael Sperberg-McQueen, and Dan Connolly for their work on HyperText References, and Norman Walsh, Richard Tobin, Henry S. Thomson, John Cowan, Paul Grosso, and the XML Core Working Group of the W3C for their work on LEIRIs. @@ -2179,21 +2176,21 @@ In addition, this document was influenced by contributions from (in no particular order) Chris Lilley, Bjoern Hoehrmann, Felix Sasaki, Jeremy Carroll, Frank Ellermann, Michael Everson, Cary Karp, Matitiahu Allouche, Richard Ishida, Addison Phillips, Jonathan Rosenne, Najib Tounsi, Debbie Garside, Mark Davis, Sarmad Hussain, Ted Hardie, Konrad Lanz, Thomas Roessler, Lisa Dusseault, Julian Reschke, Giovanni Campagna, Anne van Kesteren, Mark Nottingham, Erik van der Poel, Marcin Hanclik, Marcos Caceres, Roy Fielding, Greg Wilkins, Pieter Hintjens, Daniel R. Tobias, Marko Martin, Maciej Stanchowiak, Wil Tan, Yui Naruse, Michael A. Puls II, Dave Thaler, - Tom Perch, John Klensin, Shawn Steele, Peter Saint-Andre, Geoffrey + Tom Petch, John Klensin, Shawn Steele, Peter Saint-Andre, Geoffrey Sneddon, Chris Weber, Alex Melnikov, Slim Amamou, SM, Tim Berners- Lee, Yaron Goland, Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir, Aharon Lanin, Thomas Milo, Murray Sargent, Marc Blanchet, and Mykyta Yevstifeyev. 12. Main Changes Since RFC 3987 This section describes the main changes since [RFC3987]. 12.1. Major restructuring of IRI processing model @@ -2473,20 +2470,24 @@ [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource Identifiers (IRIs)", RFC 3987, January 2005. [RFC4395bis] Hansen, T., Hardie, T., and L. Masinter, "Guidelines and Registration Procedures for New URI/IRI Schemes", draft-hansen-iri-4395bis-irireg-00 (work in progress), September 2010. + [RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on + Encodings for Internationalized Domain Names", RFC 6055, + February 2011. + [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other Markup Languages", Unicode Technical Report #20, World Wide Web Consortium Note, June 2003, . [UTR36] Davis, M. and M. Suignard, "Unicode Security Considerations", Unicode Technical Report #36, August 2010, . [XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking