--- 1/draft-ietf-iri-comparison-01.txt 2012-10-23 02:14:36.673342482 +0200 +++ 2/draft-ietf-iri-comparison-02.txt 2012-10-23 02:14:36.705342235 +0200 @@ -1,51 +1,54 @@ Internationalized Resource Identifiers L. Masinter (iri) Adobe Internet-Draft M. Duerst -Intended status: Standards Track Aoyama Gakuin University -Expires: September 3, 2012 March 2, 2012 +Updates: 3986 (if approved) Aoyama Gakuin University +Intended status: Standards Track October 23, 2012 +Expires: April 26, 2013 - Equivalence and Canonicalization of Internationalized Resource - Identifiers (IRIs) - draft-ietf-iri-comparison-01 + Comparison, Equivalence and Canonicalization of Internationalized + Resource Identifiers + draft-ietf-iri-comparison-02 Abstract - Internationalized Resource Identifiers (IRIs) are unicode strings + Internationalized Resource Identifiers (IRIs) are Unicode strings used to identify resources on the Internet. Applications that use - IRIs often define a means of comparing two IRIs to determine when two + IRIs often define a means of comparing IRIs to determine when two IRIs are equivalent for the purpose of that application. Some - applications also define a method for 'canonicalizing' or - 'normalizing' an IRI -- translating one IRI into another which is - equivalent under the comparison method used. + applications also define a method for canonicalizing an IRI -- + translating one IRI into another which is equivalent under the + comparison method used. This document gives guidelines and best practices for defining and - using IRI comparison, equivalence, normalization and canonicalization - methods. + using IRI comparison and canonicalization methods. + + Comparison methods are used to determine equivalence. As URIs are a + subset of IRIs, the guidelines apply to URI comparison as well. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on September 3, 2012. + This Internet-Draft will expire on April 26, 2013. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -63,180 +66,181 @@ Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 - 2. Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 4 - 3. Comparison, Equivalence, Normalization and - Canonicalization . . . . . . . . . . . . . . . . . . . . . . . 5 - 4. Preparation for Comparison . . . . . . . . . . . . . . . . . . 5 - 5. Comparison Ladder . . . . . . . . . . . . . . . . . . . . . . 6 - 5.1. Simple String Comparison . . . . . . . . . . . . . . . . . 6 - 5.2. Syntax-Based Equivalence . . . . . . . . . . . . . . . . . 7 - 5.2.1. Case Equivalence . . . . . . . . . . . . . . . . . . . 8 - 5.2.2. Unicode Character Normalization . . . . . . . . . . . 8 - 5.2.3. Percent-Encoding Equivalence . . . . . . . . . . . . . 9 - 5.2.4. Path Segment Equivalence . . . . . . . . . . . . . . . 10 - 5.3. Scheme-Based Comparison . . . . . . . . . . . . . . . . . 10 - 5.4. Protocol-Based Comparison . . . . . . . . . . . . . . . . 11 - 6. Security Considerations . . . . . . . . . . . . . . . . . . . 12 - 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 - 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 - 8.1. Normative References . . . . . . . . . . . . . . . . . . . 12 - 8.2. Informative References . . . . . . . . . . . . . . . . . . 13 + 2. General guidelines . . . . . . . . . . . . . . . . . . . . . . 4 + 3. Preparation for Comparison . . . . . . . . . . . . . . . . . . 5 + 4. Comparison Hierarchy . . . . . . . . . . . . . . . . . . . . . 6 + 4.1. Simple String Comparison . . . . . . . . . . . . . . . . . 6 + 4.2. Syntax-Based Equivalence . . . . . . . . . . . . . . . . . 7 + 4.2.1. Case Equivalence . . . . . . . . . . . . . . . . . . . 8 + 4.2.2. Unicode Character Normalization . . . . . . . . . . . 8 + 4.2.3. Percent-Encoding Equivalence . . . . . . . . . . . . . 9 + 4.2.4. Path Segment Equivalence . . . . . . . . . . . . . . . 10 + 4.3. Scheme-Based Comparison . . . . . . . . . . . . . . . . . 10 + 4.4. Protocol-Based Comparison . . . . . . . . . . . . . . . . 11 + 5. Security Considerations . . . . . . . . . . . . . . . . . . . 12 + 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 + 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 + 7.1. Normative References . . . . . . . . . . . . . . . . . . . 12 + 7.2. Informative References . . . . . . . . . . . . . . . . . . 13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14 1. Introduction - Internationalized Resource Identifiers (IRIs) are unicode strings + Internationalized Resource Identifiers (IRIs) are Unicode strings used to identify resources on the Internet. Applications that use - IRIs often define a means of comparing two IRIs to determine when two + IRIs often define a means of comparing IRIs to determine when two IRIs are equivalent for the purpose of that application. Some - applications also define a method for 'canonicalizing' or - 'normalizing' an IRI -- translating one IRI into another which is - equivalent under the comparison method used. + applications also define a method for canonicalizing an IRI -- + translating one IRI into another which is equivalent under the + comparison method used. This document gives guidelines and best practices for defining and - using IRI comparison, equivalence, normalization and canonicalization - methods. + using IRI comparison and canonicalization methods. - One of the most common operations on IRIs is simple comparison: - Determining whether two IRIs are equivalent, without using the IRIs - to access their respective resource(s). A comparison is performed - whenever a response cache is accessed, a browser checks its history - to color a link, or an XML parser processes tags within a namespace. - Extensive normalization prior to comparison of IRIs may be used by - spiders and indexing engines to prune a search space or reduce - duplication of request actions and response storage. + As every URI is also an IRI, the comparison and canonicalization + methods also apply to URIs. + + IRI comparison is expected to determine whether two IRIs are + equivalent without using the IRIs to access their respective + resource(s). For example, comparisons are performed whenever a + response cache is accessed, a browser checks its history to color a + link, or an XML parser processes tags within a namespace. + + Comparison for equivalence is often accomplished by canonicalization: + (sometimes called normalization): a process for converting data that + has more than one possible representation into a "standard", + "normal", or "canonical" form. Extensive canonicalization prior to + comparison of IRIs may be used by spiders and indexing engines to + prune a search space or reduce duplication of request actions and + response storage. IRI comparison is performed for some particular purpose. Protocols or implementations that compare IRIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This document describes various methods that may be used to compare IRIs, the trade-offs between them, and the types of applications that might use them. -2. Equivalence +2. General guidelines - Because IRIs exist to identify resources, presumably they should be - considered equivalent when they identify the same resource. However, - this definition of equivalence is not of much practical use, as there - is no way for an implementation to compare two resources to determine - if they are "the same" unless it has full knowledge or control of - them. For this reason, determination of equivalence or difference of - IRIs is based on string comparison, augmented by reference to - additional rules provided by scheme definition. We use the terms - "different" and "equivalent" to describe the possible outcomes of - such comparisons, but there are many application-dependent versions - of equivalence. + Because IRIs exist to identify resources, one might expect two IRIs + to be considered equivalent when they identify the same resource. + However, this definition of equivalence is not of much practical use, + as there is in general no way for an implementation to compare two + resources to determine if they are "the same" unless it has full + knowledge or control of them. Comparison methods for IRIs are + generally based strictly on examining the characters that make up the + IRI, without performing any network access. + + We use the terms "different" and "equivalent" to describe the + possible outcomes of such comparisons, but there are many + application-dependent versions of equivalence. Even when it is possible to determine that two IRIs are equivalent, IRI comparison is not sufficient to determine whether two IRIs identify different resources. For example, an owner of two different domain names could decide to serve the same resource from both, - resulting in two different IRIs. Therefore, comparison methods are - designed to minimize false negatives while strictly avoiding false - positives. - - In testing for equivalence, applications should not directly compare - relative references; the references should be converted to their - respective target IRIs before comparison. When IRIs are compared to - select (or avoid) a network action, such as retrieval of a - representation, fragment components (if any) MUST be excluded from - the comparison. - - Applications using IRIs as identity tokens with no relationship to a - protocol MUST use the Simple String Comparison (see Section 5.1). - All other applications MUST select one of the comparison practices - from the Comparison Ladder (see Section 5. + resulting in two different IRIs. For this reason, false negatives + (e.g., returning "different" even with the resources are "the same") + cannot be completely avoided. Comparison methods often try to + minimize false negatives while strictly avoiding false positives. + However, in some cases (such as cache invalidation), false negatives + are more harmful than false positives. -3. Comparison, Equivalence, Normalization and Canonicalization + A comparison method for determining equivalence might have multiple + values, for example, returning "equivalent", "different", or + "equivalence cannot be determined". - In general, when considering a set of items or strings, there are - several interrelated concepts. A comparison method determines, - between two items in the set, their relationship. In particular, a - comparison method for determining equivalence might result in a - determination that two (different) items are equivalent, known to be - different, or that equivalence isn't determined. + Multiple canonicalization (normalizations) methods might be defined, + where sequential application of each results in greater sets of + equivalent values. - One way to define a comparison for equivalence is to define a a - normalization or canonicalization algorithm. For each item in a set - of equivalent items, one of them could be designated the "normal" or - "canonical" form. + In testing for equivalence, applications should not directly compare + relative references; the references should be converted to their + respective target IRIs before comparison. [[ref 3987bis]] - These general concepts are used with IRIs in this document, and in - other circumstances, where a mapping from one sequence of Unicode - characters to another one could be described as a "normalization" - algorithm. + Some IRIs contain fragment identifiers. In general, the equivalence + of two IRIs is determined first by comparing the IRIs without any + fragment identifiers, and then (if appropriate) the fragment + components (if any) compared. - In general, this document tries to stay with the "equivalence" or - "comparison" methods, become some times the mathematical notion of - "normalization" results in forms that ordinary users might not - consider "normal" in an ordinary sense. + Some applications (such as XML namespaces) use IRIs as identity + tokens without any relationship to acessing the resources. Those + applications use the Simple String Comparison (see Section 4.1). -4. Preparation for Comparison +3. Preparation for Comparison Any kind of IRI comparison REQUIRES that any additional contextual processing is first performed, including undoing higher-level escapings or encodings in the protocol or format that carries an IRI. This preprocessing is usually done when the protocol or format is parsed. + NOTE: This document has not yet been updated to use in-line Unicode + examples. + Examples of such escapings or encodings are entities and numeric character references in [HTML4] and [XML1]. As an example, "http://example.org/rosé" (in HTML), "http://example.org/rosé" (in HTML or XML), and "http://example.org/rosé" (in HTML or XML) are all resolved into what is denoted in this document (see 'Notation' section of [RFC3987bis]) as "http://example.org/rosé" (the "é" here standing for the actual e-acute character, to compensate for the fact that this document cannot contain non-ASCII characters). - Similar considerations apply to encodings such as Transfer Codings in - HTTP (see [RFC2616]) and Content Transfer Encodings in MIME - ([RFC2045]), although in these cases, the encoding is based not on - characters but on octets, and additional care is required to make - sure that characters, and not just arbitrary octets, are compared - (see Section 5.1). + An IRI is a sequence of Unicode characters. IRIs are sometimes + represented in documents as sequences of bytes in a charset, either + Unicode-based (UTF-8) or using some other character encoding (e.g., + ISO-8859-1). Before comparing two such sequences, they must both be + converted into sequences of Unicode characters. -5. Comparison Ladder + Similarly, encodings such as Transfer Codings in HTTP (see [RFC2616]) + and Content Transfer Encodings in MIME ([RFC2045]) must be unencoded. + In these cases, the encoding is based not on characters but on + octets, and additional care is required to make sure that characters, + and not just arbitrary octets, are compared (see Section 4.1. + +4. Comparison Hierarchy In practice, a variety of methods are used to test IRI equivalence. - These methods fall into a range distinguished by the amount of - processing required and the degree to which the probability of false - negatives is reduced. As noted above, false negatives cannot be - eliminated. In practice, their probability can be reduced, but this - reduction requires more processing and is not cost-effective for all - applications. + These methods generally fall into a range distinguished by the amount + of processing required and the degree to which the probability of + false negatives is reduced. As noted above, false negatives cannot + be eliminated. In practice, their probability can be reduced, but + this reduction requires more processing and is not cost-effective for + all applications. - If this range of comparison practices is considered as a ladder, the - following discussion will climb the ladder, starting with practices - that are cheap but have a relatively higher chance of producing false + The following discussion starts with comparison methods that are + cheap but have a relatively higher chance of producing false negatives, and proceeding to those that have higher computational cost and lower risk of false negatives. -5.1. Simple String Comparison +4.1. Simple String Comparison - If two IRIs, when considered as character strings, are identical, - then it is safe to conclude that they are equivalent. This type of - equivalence test has very low computational cost and is in wide use - in a variety of applications, particularly in the domain of parsing. - It is also used when a definitive answer to the question of IRI - equivalence is needed that is independent of the scheme used and that - can be calculated quickly and without accessing a network. An - example of such a case is XML Namespaces ([XMLNamespace]). + If two IRIs (when considered as strings of Unicode characters) are + identical, then it is safe to conclude that they are equivalent. + This type of equivalence test has very low computational cost and is + in wide use in a variety of applications, particularly in the domain + of parsing. It is also used when a definitive answer to the question + of IRI equivalence is needed that is independent of the scheme used + and that can be calculated quickly and without accessing a network. + An example of such a case is XML Namespaces ([XMLNamespace]). Testing strings for equivalence requires some basic precautions. This procedure is often referred to as "bit-for-bit" or "byte-for- byte" comparison, which is potentially misleading. Testing strings for equality is normally based on pair comparison of the characters that make up the strings, starting from the first and proceeding until both strings are exhausted and all characters are found to be equal, until a pair of characters compares unequal, or until one of the strings is exhausted before the other. @@ -250,71 +254,71 @@ codepoint by codepoint after conversion to a common character encoding form. When comparing character by character, the comparison function MUST NOT map IRIs to URIs, because such a mapping would create additional spurious equivalences. It follows that an IRI SHOULD NOT be modified when being transported if there is any chance that this IRI might be used in a context that uses Simple String Comparison. False negatives are caused by the production and use of IRI aliases. Unnecessary aliases can be reduced, regardless of the comparison - method, by consistently providing IRI references in an already - normalized form (i.e., a form identical to what would be produced - after normalization is applied, as described below). Protocols and - data formats often limit some IRI comparisons to simple string - comparison, based on the theory that people and implementations will, - in their own best interest, be consistent in providing IRI - references, or at least be consistent enough to negate any efficiency - that might be obtained from further normalization. + method, by consistently providing IRI references in a canonical form + (after canonicalization is applied). -5.2. Syntax-Based Equivalence + Protocols and data formats might limit some IRI comparisons to simple + string comparison, based on the theory that people and + implementations will, in their own best interest, be consistent in + providing IRI references, or at least be consistent enough to negate + any efficiency that might be obtained from further canonicalization. + +4.2. Syntax-Based Equivalence Implementations may use logic based on the definitions provided by this specification to reduce the probability of false negatives. This processing is moderately higher in cost than character-for- character string comparison. For example, an application using this approach could reasonably consider the following two IRIs equivalent: example://a/b/c/%7Bfoo%7D/rosé eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 Web user agents, such as browsers, typically apply this type of IRI - normalization when determining whether a cached response is - available. Syntax-based normalization includes such techniques as - case normalization, character normalization, percent-encoding - normalization, and removal of dot-segments. + equivalence when determining whether a cached response is available. + Syntax-based equivalence includes such techniques as case + equivalence, Unicode character normalization, percent-encoding + equivalence, and removal of dot-segments. -5.2.1. Case Equivalence +4.2.1. Case Equivalence For all IRIs, the hexadecimal digits within a percent-encoding triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore should be considered equivalent to forms which use uppercase letters for the digits A-F. When an IRI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and US-ASCII only host are case insensitive and therefore should be - normalized to lowercase. For example, the URI + treated equivalent to lowercase. For example, the URI "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/". Case equivalence for non-ASCII characters in IRI components that are - IDNs are discussed in Section 5.3. The other generic syntax + IDNs are discussed in Section 4.3. The other generic syntax components are assumed to be case sensitive unless specifically defined otherwise by the scheme. Creating schemes that allow case-insensitive syntax components - containing non-ASCII characters should be avoided. Case - normalization of non-ASCII characters can be culturally dependent and - is always a complex operation. The only exception concerns non-ASCII - host names for which the character normalization includes a mapping - step derived from case folding. + containing non-ASCII characters should be avoided. Case equivalence + of non-ASCII characters can be culturally dependent and is always a + complex operation. The only exception concerns non-ASCII host names + for which the character normalization includes a mapping step derived + from case folding. -5.2.2. Unicode Character Normalization +4.2.2. Unicode Character Normalization The Unicode Standard [UNIV6] defines various equivalences between sequences of characters for various purposes. Unicode Standard Annex #15 [UTR15] defines various Normalization Forms for these equivalences, in particular Normalization Form C (NFC, Canonical Decomposition, followed by Canonical Composition) and Normalization Form KC (NFKC, Compatibility Decomposition, followed by Canonical Composition). IRIs already in Unicode MUST NOT be normalized before parsing or @@ -358,158 +362,152 @@ unclear whether they are case sensitive, case insensitive, or something in between (e.g., case sensitive, but with a multiple choice selection if the wrong case is used, instead of a direct negative result). The best recipe is that the creator use a reasonable capitalization and, when transferring the URI, capitalization never be changed. Various IRI schemes may allow the usage of Internationalized Domain Names (IDN) [RFC5890] either in the ireg-name part or elsewhere. Character Normalization also applies to IDNs, as discussed in - Section 5.3. + Section 4.3. -5.2.3. Percent-Encoding Equivalence +4.2.3. Percent-Encoding Equivalence The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a frequent source of variance among otherwise identical IRIs. In - addition to the case normalization issue noted above, some IRI + addition to the case equivalence issue noted above, some IRI producers percent-encode octets that do not require percent-encoding, resulting in IRIs that are equivalent to their nonencoded - counterparts. These IRIs should be normalized by decoding any + counterparts. These IRIs should be compared by first decoding any percent-encoded octet sequence that corresponds to an unreserved character, as described in section 2.3 of [RFC3986]. For actual resolution, differences in percent-encoding (except for - the percent-encoding of reserved characters) MUST always result in + the percent-encoding of reserved characters) SHOULD always result in the same resource. For example, "http://example.org/~user", - "http://example.org/%7euser", and "http://example.org/%7Euser", must - resolve to the same resource. + "http://example.org/%7euser", and "http://example.org/%7Euser", + SHOULD resolve to the same resource. If this kind of equivalence is to be tested, the percent-encoding of - both IRIs to be compared has to be aligned; for example, by + both IRIs to be compared first needs to be aligned; for example, by converting both IRIs to URIs, eliminating escape differences in the resulting URIs, and making sure that the case of the hexadecimal characters in the percent-encoding is always the same (preferably upper case). If the IRI is to be passed to another application or used further in some other way, its original form MUST be preserved. The conversion described here should be performed only for local comparison. -5.2.4. Path Segment Equivalence +4.2.4. Path Segment Equivalence The complete path segments "." and ".." are intended only for use within relative references (Section 4.1 of [RFC3986]) and are removed as part of the reference resolution process (Section 5.2 of [RFC3986]). However, some implementations may incorrectly assume that reference resolution is not necessary when the reference is already an IRI, and thus fail to remove dot-segments when they occur - in non-relative paths. IRI normalizers should remove dot-segments by + in non-relative paths. IRI comparison SHOULD remove dot-segments by applying the remove_dot_segments algorithm to the path, as described in Section 5.2.4 of [RFC3986]. -5.3. Scheme-Based Comparison +4.3. Scheme-Based Comparison The syntax and semantics of IRIs vary from scheme to scheme, as described by the defining specification for each scheme. Implementations may use scheme-specific rules, at further processing cost, to reduce the probability of false negatives. For example, because the "http" scheme makes use of an authority component, has a default port of "80", and defines an empty path to be equivalent to "/", the following four IRIs are equivalent: http://example.com http://example.com/ http://example.com:/ http://example.com:80/ In general, an IRI that uses the generic syntax for authority with an - empty path should be normalized to a path of "/". Likewise, an + empty path should be equivalent to a path of "/". Likewise, an explicit ":port", for which the port is empty or the default for the scheme, is equivalent to one where the port and its ":" delimiter are - elided and thus should be removed by scheme-based normalization. For - example, the second IRI above is the normal form for the "http" - scheme. + elided. - Another case where normalization varies by scheme is in the handling - of an empty authority component or empty host subcomponent. For many + Another case where equivalence varies by scheme is in the handling of + an empty authority component or empty host subcomponent. For many scheme specifications, an empty authority or host is considered an error; for others, it is considered equivalent to "localhost" or the - end-user's host. When a scheme defines a default for authority and - an IRI reference to that default is desired, the reference should be - normalized to an empty authority for the sake of uniformity, brevity, - and internationalization. If, however, either the userinfo or port - subcomponents are non-empty, then the host should be given explicitly - even if it matches the default. + end-user's host. - Normalization should not remove delimiters when their associated - component is empty unless it is licensed to do so by the scheme - specification. For example, the IRI "http://example.com/?" cannot be - assumed to be equivalent to any of the examples above. Likewise, the - presence or absence of delimiters within a userinfo subcomponent is - usually significant to its interpretation. The fragment component is - not subject to any scheme-based normalization; thus, two IRIs that - differ only by the suffix "#" are considered different regardless of - the scheme. + The presence of a missing component vs. one with an empty string + component in an IRI SHOULD NOT be treated as equivalent unless + explicitly defined as such by the scheme definition. For example, + the IRI "http://example.com/?" cannot be assumed to be equivalent to + any of the examples above; an empty query component is NOT equivalent + to a missing one. Likewise, the presence or absence of delimiters + within a userinfo subcomponent is usually significant to its + interpretation. The fragment component is not subject to any scheme- + based equivalence; thus, two IRIs that differ only by the suffix "#" + are considered different regardless of the scheme. Some IRI schemes allow the usage of Internationalized Domain Names (IDN) [RFC5890] either in their ireg-name part or elswhere. When in use in IRIs, those names SHOULD conform to the definition of U-Label in [RFC5890]. An IRI containing an invalid IDN cannot successfully be resolved. For legibility purposes, they SHOULD NOT be converted into ASCII Compatible Encoding (ACE). - Scheme-based normalization may also consider IDN components and their + Scheme-based comparison may also consider IDN components and their conversions to punycode as equivalent. As an example, "http://résumé.example.org" may be considered equivalent to "http://xn--rsum-bpad.example.org". - Other scheme-specific normalizations are possible. + Other scheme-specific equivalence rules are possible. -5.4. Protocol-Based Comparison +4.4. Protocol-Based Comparison Substantial effort to reduce the incidence of false negatives is often cost-effective for web spiders. Consequently, they implement even more aggressive techniques in IRI comparison. For example, if they observe that an IRI such as + http://example.com/data redirects to an IRI differing only in the trailing slash - http://example.com/data/ they will likely regard the two as equivalent in the future. This kind of technique is only appropriate when equivalence is clearly indicated by both the result of accessing the resources and the common conventions of their scheme's dereference algorithm (in this case, use of redirection by HTTP origin servers to avoid problems with relative references). -6. Security Considerations +5. Security Considerations The primary security difficulty comes from applications choosing the wrong equivalence relationship, or two different parties disagreeing on equivalence. This is especially a problem when IRIs are used in security protocols. Besides the large character repertoire of Unicode, reasons for confusion include different forms of normalization and different normalization expectations, use of percent-encoding with various legacy encodings, and bidirectionality issues. See also [UTR36]. -7. Acknowledgements +6. Acknowledgements This document was originally derived from [RFC3986] and [RFC3987], based on text contributed by Tim Bray. -8. References +7. References -8.1. Normative References +7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003. [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", @@ -517,37 +515,37 @@ [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003. [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [RFC3987bis] Duerst, M., Masinter, L., and M. Suignard, - "Internationalized Resource Identifiers (IRIs)", 2011, + "Internationalized Resource Identifiers (IRIs)", 2012, . [RFC5890] Klensin, J., "Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework", RFC 5890, August 2010. [UNIV6] The Unicode Consortium, "The Unicode Standard, Version 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, ISBN 978-1-936213-01-6)", October 2010. [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", Unicode Standard Annex #15, March 2008, . -8.2. Informative References +7.2. Informative References [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 Specification", World Wide Web Consortium Recommendation, December 1999, . [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, November 1996.