draft-ietf-iri-comparison-01.txt | draft-ietf-iri-comparison-02.txt | |||
---|---|---|---|---|
Internationalized Resource Identifiers L. Masinter | Internationalized Resource Identifiers L. Masinter | |||
(iri) Adobe | (iri) Adobe | |||
Internet-Draft M. Duerst | Internet-Draft M. Duerst | |||
Intended status: Standards Track Aoyama Gakuin University | Updates: 3986 (if approved) Aoyama Gakuin University | |||
Expires: September 3, 2012 March 2, 2012 | Intended status: Standards Track October 23, 2012 | |||
Expires: April 26, 2013 | ||||
Equivalence and Canonicalization of Internationalized Resource | Comparison, Equivalence and Canonicalization of Internationalized | |||
Identifiers (IRIs) | Resource Identifiers | |||
draft-ietf-iri-comparison-01 | draft-ietf-iri-comparison-02 | |||
Abstract | Abstract | |||
Internationalized Resource Identifiers (IRIs) are unicode strings | Internationalized Resource Identifiers (IRIs) are Unicode strings | |||
used to identify resources on the Internet. Applications that use | used to identify resources on the Internet. Applications that use | |||
IRIs often define a means of comparing two IRIs to determine when two | IRIs often define a means of comparing IRIs to determine when two | |||
IRIs are equivalent for the purpose of that application. Some | IRIs are equivalent for the purpose of that application. Some | |||
applications also define a method for 'canonicalizing' or | applications also define a method for canonicalizing an IRI -- | |||
'normalizing' an IRI -- translating one IRI into another which is | translating one IRI into another which is equivalent under the | |||
equivalent under the comparison method used. | comparison method used. | |||
This document gives guidelines and best practices for defining and | This document gives guidelines and best practices for defining and | |||
using IRI comparison, equivalence, normalization and canonicalization | using IRI comparison and canonicalization methods. | |||
methods. | ||||
Comparison methods are used to determine equivalence. As URIs are a | ||||
subset of IRIs, the guidelines apply to URI comparison as well. | ||||
Status of this Memo | Status of this Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on September 3, 2012. | This Internet-Draft will expire on April 26, 2013. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2012 IETF Trust and the persons identified as the | Copyright (c) 2012 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 3, line 8 | skipping to change at page 3, line 8 | |||
Without obtaining an adequate license from the person(s) controlling | Without obtaining an adequate license from the person(s) controlling | |||
the copyright in such materials, this document may not be modified | the copyright in such materials, this document may not be modified | |||
outside the IETF Standards Process, and derivative works of it may | outside the IETF Standards Process, and derivative works of it may | |||
not be created outside the IETF Standards Process, except to format | not be created outside the IETF Standards Process, except to format | |||
it for publication as an RFC or to translate it into languages other | it for publication as an RFC or to translate it into languages other | |||
than English. | than English. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
2. Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 2. General guidelines . . . . . . . . . . . . . . . . . . . . . . 4 | |||
3. Comparison, Equivalence, Normalization and | 3. Preparation for Comparison . . . . . . . . . . . . . . . . . . 5 | |||
Canonicalization . . . . . . . . . . . . . . . . . . . . . . . 5 | 4. Comparison Hierarchy . . . . . . . . . . . . . . . . . . . . . 6 | |||
4. Preparation for Comparison . . . . . . . . . . . . . . . . . . 5 | 4.1. Simple String Comparison . . . . . . . . . . . . . . . . . 6 | |||
5. Comparison Ladder . . . . . . . . . . . . . . . . . . . . . . 6 | 4.2. Syntax-Based Equivalence . . . . . . . . . . . . . . . . . 7 | |||
5.1. Simple String Comparison . . . . . . . . . . . . . . . . . 6 | 4.2.1. Case Equivalence . . . . . . . . . . . . . . . . . . . 8 | |||
5.2. Syntax-Based Equivalence . . . . . . . . . . . . . . . . . 7 | 4.2.2. Unicode Character Normalization . . . . . . . . . . . 8 | |||
5.2.1. Case Equivalence . . . . . . . . . . . . . . . . . . . 8 | 4.2.3. Percent-Encoding Equivalence . . . . . . . . . . . . . 9 | |||
5.2.2. Unicode Character Normalization . . . . . . . . . . . 8 | 4.2.4. Path Segment Equivalence . . . . . . . . . . . . . . . 10 | |||
5.2.3. Percent-Encoding Equivalence . . . . . . . . . . . . . 9 | 4.3. Scheme-Based Comparison . . . . . . . . . . . . . . . . . 10 | |||
5.2.4. Path Segment Equivalence . . . . . . . . . . . . . . . 10 | 4.4. Protocol-Based Comparison . . . . . . . . . . . . . . . . 11 | |||
5.3. Scheme-Based Comparison . . . . . . . . . . . . . . . . . 10 | 5. Security Considerations . . . . . . . . . . . . . . . . . . . 12 | |||
5.4. Protocol-Based Comparison . . . . . . . . . . . . . . . . 11 | 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
6. Security Considerations . . . . . . . . . . . . . . . . . . . 12 | 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 | 7.1. Normative References . . . . . . . . . . . . . . . . . . . 12 | |||
8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 | 7.2. Informative References . . . . . . . . . . . . . . . . . . 13 | |||
8.1. Normative References . . . . . . . . . . . . . . . . . . . 12 | ||||
8.2. Informative References . . . . . . . . . . . . . . . . . . 13 | ||||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14 | |||
1. Introduction | 1. Introduction | |||
Internationalized Resource Identifiers (IRIs) are unicode strings | Internationalized Resource Identifiers (IRIs) are Unicode strings | |||
used to identify resources on the Internet. Applications that use | used to identify resources on the Internet. Applications that use | |||
IRIs often define a means of comparing two IRIs to determine when two | IRIs often define a means of comparing IRIs to determine when two | |||
IRIs are equivalent for the purpose of that application. Some | IRIs are equivalent for the purpose of that application. Some | |||
applications also define a method for 'canonicalizing' or | applications also define a method for canonicalizing an IRI -- | |||
'normalizing' an IRI -- translating one IRI into another which is | translating one IRI into another which is equivalent under the | |||
equivalent under the comparison method used. | comparison method used. | |||
This document gives guidelines and best practices for defining and | This document gives guidelines and best practices for defining and | |||
using IRI comparison, equivalence, normalization and canonicalization | using IRI comparison and canonicalization methods. | |||
methods. | ||||
One of the most common operations on IRIs is simple comparison: | As every URI is also an IRI, the comparison and canonicalization | |||
Determining whether two IRIs are equivalent, without using the IRIs | methods also apply to URIs. | |||
to access their respective resource(s). A comparison is performed | ||||
whenever a response cache is accessed, a browser checks its history | IRI comparison is expected to determine whether two IRIs are | |||
to color a link, or an XML parser processes tags within a namespace. | equivalent without using the IRIs to access their respective | |||
Extensive normalization prior to comparison of IRIs may be used by | resource(s). For example, comparisons are performed whenever a | |||
spiders and indexing engines to prune a search space or reduce | response cache is accessed, a browser checks its history to color a | |||
duplication of request actions and response storage. | link, or an XML parser processes tags within a namespace. | |||
Comparison for equivalence is often accomplished by canonicalization: | ||||
(sometimes called normalization): a process for converting data that | ||||
has more than one possible representation into a "standard", | ||||
"normal", or "canonical" form. Extensive canonicalization prior to | ||||
comparison of IRIs may be used by spiders and indexing engines to | ||||
prune a search space or reduce duplication of request actions and | ||||
response storage. | ||||
IRI comparison is performed for some particular purpose. Protocols | IRI comparison is performed for some particular purpose. Protocols | |||
or implementations that compare IRIs for different purposes will | or implementations that compare IRIs for different purposes will | |||
often be subject to differing design trade-offs in regards to how | often be subject to differing design trade-offs in regards to how | |||
much effort should be spent in reducing aliased identifiers. This | much effort should be spent in reducing aliased identifiers. This | |||
document describes various methods that may be used to compare IRIs, | document describes various methods that may be used to compare IRIs, | |||
the trade-offs between them, and the types of applications that might | the trade-offs between them, and the types of applications that might | |||
use them. | use them. | |||
2. Equivalence | 2. General guidelines | |||
Because IRIs exist to identify resources, presumably they should be | Because IRIs exist to identify resources, one might expect two IRIs | |||
considered equivalent when they identify the same resource. However, | to be considered equivalent when they identify the same resource. | |||
this definition of equivalence is not of much practical use, as there | However, this definition of equivalence is not of much practical use, | |||
is no way for an implementation to compare two resources to determine | as there is in general no way for an implementation to compare two | |||
if they are "the same" unless it has full knowledge or control of | resources to determine if they are "the same" unless it has full | |||
them. For this reason, determination of equivalence or difference of | knowledge or control of them. Comparison methods for IRIs are | |||
IRIs is based on string comparison, augmented by reference to | generally based strictly on examining the characters that make up the | |||
additional rules provided by scheme definition. We use the terms | IRI, without performing any network access. | |||
"different" and "equivalent" to describe the possible outcomes of | ||||
such comparisons, but there are many application-dependent versions | We use the terms "different" and "equivalent" to describe the | |||
of equivalence. | possible outcomes of such comparisons, but there are many | |||
application-dependent versions of equivalence. | ||||
Even when it is possible to determine that two IRIs are equivalent, | Even when it is possible to determine that two IRIs are equivalent, | |||
IRI comparison is not sufficient to determine whether two IRIs | IRI comparison is not sufficient to determine whether two IRIs | |||
identify different resources. For example, an owner of two different | identify different resources. For example, an owner of two different | |||
domain names could decide to serve the same resource from both, | domain names could decide to serve the same resource from both, | |||
resulting in two different IRIs. Therefore, comparison methods are | resulting in two different IRIs. For this reason, false negatives | |||
designed to minimize false negatives while strictly avoiding false | (e.g., returning "different" even with the resources are "the same") | |||
positives. | cannot be completely avoided. Comparison methods often try to | |||
minimize false negatives while strictly avoiding false positives. | ||||
In testing for equivalence, applications should not directly compare | However, in some cases (such as cache invalidation), false negatives | |||
relative references; the references should be converted to their | are more harmful than false positives. | |||
respective target IRIs before comparison. When IRIs are compared to | ||||
select (or avoid) a network action, such as retrieval of a | ||||
representation, fragment components (if any) MUST be excluded from | ||||
the comparison. | ||||
Applications using IRIs as identity tokens with no relationship to a | ||||
protocol MUST use the Simple String Comparison (see Section 5.1). | ||||
All other applications MUST select one of the comparison practices | ||||
from the Comparison Ladder (see Section 5. | ||||
3. Comparison, Equivalence, Normalization and Canonicalization | A comparison method for determining equivalence might have multiple | |||
values, for example, returning "equivalent", "different", or | ||||
"equivalence cannot be determined". | ||||
In general, when considering a set of items or strings, there are | Multiple canonicalization (normalizations) methods might be defined, | |||
several interrelated concepts. A comparison method determines, | where sequential application of each results in greater sets of | |||
between two items in the set, their relationship. In particular, a | equivalent values. | |||
comparison method for determining equivalence might result in a | ||||
determination that two (different) items are equivalent, known to be | ||||
different, or that equivalence isn't determined. | ||||
One way to define a comparison for equivalence is to define a a | In testing for equivalence, applications should not directly compare | |||
normalization or canonicalization algorithm. For each item in a set | relative references; the references should be converted to their | |||
of equivalent items, one of them could be designated the "normal" or | respective target IRIs before comparison. [[ref 3987bis]] | |||
"canonical" form. | ||||
These general concepts are used with IRIs in this document, and in | Some IRIs contain fragment identifiers. In general, the equivalence | |||
other circumstances, where a mapping from one sequence of Unicode | of two IRIs is determined first by comparing the IRIs without any | |||
characters to another one could be described as a "normalization" | fragment identifiers, and then (if appropriate) the fragment | |||
algorithm. | components (if any) compared. | |||
In general, this document tries to stay with the "equivalence" or | Some applications (such as XML namespaces) use IRIs as identity | |||
"comparison" methods, become some times the mathematical notion of | tokens without any relationship to acessing the resources. Those | |||
"normalization" results in forms that ordinary users might not | applications use the Simple String Comparison (see Section 4.1). | |||
consider "normal" in an ordinary sense. | ||||
4. Preparation for Comparison | 3. Preparation for Comparison | |||
Any kind of IRI comparison REQUIRES that any additional contextual | Any kind of IRI comparison REQUIRES that any additional contextual | |||
processing is first performed, including undoing higher-level | processing is first performed, including undoing higher-level | |||
escapings or encodings in the protocol or format that carries an IRI. | escapings or encodings in the protocol or format that carries an IRI. | |||
This preprocessing is usually done when the protocol or format is | This preprocessing is usually done when the protocol or format is | |||
parsed. | parsed. | |||
NOTE: This document has not yet been updated to use in-line Unicode | ||||
examples. | ||||
Examples of such escapings or encodings are entities and numeric | Examples of such escapings or encodings are entities and numeric | |||
character references in [HTML4] and [XML1]. As an example, | character references in [HTML4] and [XML1]. As an example, | |||
"http://example.org/rosé" (in HTML), | "http://example.org/rosé" (in HTML), | |||
"http://example.org/rosé" (in HTML or XML), and | "http://example.org/rosé" (in HTML or XML), and | |||
"http://example.org/rosé" (in HTML or XML) are all resolved into | "http://example.org/rosé" (in HTML or XML) are all resolved into | |||
what is denoted in this document (see 'Notation' section of | what is denoted in this document (see 'Notation' section of | |||
[RFC3987bis]) as "http://example.org/rosé" (the "é" here | [RFC3987bis]) as "http://example.org/rosé" (the "é" here | |||
standing for the actual e-acute character, to compensate for the fact | standing for the actual e-acute character, to compensate for the fact | |||
that this document cannot contain non-ASCII characters). | that this document cannot contain non-ASCII characters). | |||
Similar considerations apply to encodings such as Transfer Codings in | An IRI is a sequence of Unicode characters. IRIs are sometimes | |||
HTTP (see [RFC2616]) and Content Transfer Encodings in MIME | represented in documents as sequences of bytes in a charset, either | |||
([RFC2045]), although in these cases, the encoding is based not on | Unicode-based (UTF-8) or using some other character encoding (e.g., | |||
characters but on octets, and additional care is required to make | ISO-8859-1). Before comparing two such sequences, they must both be | |||
sure that characters, and not just arbitrary octets, are compared | converted into sequences of Unicode characters. | |||
(see Section 5.1). | ||||
5. Comparison Ladder | Similarly, encodings such as Transfer Codings in HTTP (see [RFC2616]) | |||
and Content Transfer Encodings in MIME ([RFC2045]) must be unencoded. | ||||
In these cases, the encoding is based not on characters but on | ||||
octets, and additional care is required to make sure that characters, | ||||
and not just arbitrary octets, are compared (see Section 4.1. | ||||
4. Comparison Hierarchy | ||||
In practice, a variety of methods are used to test IRI equivalence. | In practice, a variety of methods are used to test IRI equivalence. | |||
These methods fall into a range distinguished by the amount of | These methods generally fall into a range distinguished by the amount | |||
processing required and the degree to which the probability of false | of processing required and the degree to which the probability of | |||
negatives is reduced. As noted above, false negatives cannot be | false negatives is reduced. As noted above, false negatives cannot | |||
eliminated. In practice, their probability can be reduced, but this | be eliminated. In practice, their probability can be reduced, but | |||
reduction requires more processing and is not cost-effective for all | this reduction requires more processing and is not cost-effective for | |||
applications. | all applications. | |||
If this range of comparison practices is considered as a ladder, the | The following discussion starts with comparison methods that are | |||
following discussion will climb the ladder, starting with practices | cheap but have a relatively higher chance of producing false | |||
that are cheap but have a relatively higher chance of producing false | ||||
negatives, and proceeding to those that have higher computational | negatives, and proceeding to those that have higher computational | |||
cost and lower risk of false negatives. | cost and lower risk of false negatives. | |||
5.1. Simple String Comparison | 4.1. Simple String Comparison | |||
If two IRIs, when considered as character strings, are identical, | If two IRIs (when considered as strings of Unicode characters) are | |||
then it is safe to conclude that they are equivalent. This type of | identical, then it is safe to conclude that they are equivalent. | |||
equivalence test has very low computational cost and is in wide use | This type of equivalence test has very low computational cost and is | |||
in a variety of applications, particularly in the domain of parsing. | in wide use in a variety of applications, particularly in the domain | |||
It is also used when a definitive answer to the question of IRI | of parsing. It is also used when a definitive answer to the question | |||
equivalence is needed that is independent of the scheme used and that | of IRI equivalence is needed that is independent of the scheme used | |||
can be calculated quickly and without accessing a network. An | and that can be calculated quickly and without accessing a network. | |||
example of such a case is XML Namespaces ([XMLNamespace]). | An example of such a case is XML Namespaces ([XMLNamespace]). | |||
Testing strings for equivalence requires some basic precautions. | Testing strings for equivalence requires some basic precautions. | |||
This procedure is often referred to as "bit-for-bit" or "byte-for- | This procedure is often referred to as "bit-for-bit" or "byte-for- | |||
byte" comparison, which is potentially misleading. Testing strings | byte" comparison, which is potentially misleading. Testing strings | |||
for equality is normally based on pair comparison of the characters | for equality is normally based on pair comparison of the characters | |||
that make up the strings, starting from the first and proceeding | that make up the strings, starting from the first and proceeding | |||
until both strings are exhausted and all characters are found to be | until both strings are exhausted and all characters are found to be | |||
equal, until a pair of characters compares unequal, or until one of | equal, until a pair of characters compares unequal, or until one of | |||
the strings is exhausted before the other. | the strings is exhausted before the other. | |||
skipping to change at page 7, line 31 | skipping to change at page 7, line 33 | |||
codepoint by codepoint after conversion to a common character | codepoint by codepoint after conversion to a common character | |||
encoding form. When comparing character by character, the comparison | encoding form. When comparing character by character, the comparison | |||
function MUST NOT map IRIs to URIs, because such a mapping would | function MUST NOT map IRIs to URIs, because such a mapping would | |||
create additional spurious equivalences. It follows that an IRI | create additional spurious equivalences. It follows that an IRI | |||
SHOULD NOT be modified when being transported if there is any chance | SHOULD NOT be modified when being transported if there is any chance | |||
that this IRI might be used in a context that uses Simple String | that this IRI might be used in a context that uses Simple String | |||
Comparison. | Comparison. | |||
False negatives are caused by the production and use of IRI aliases. | False negatives are caused by the production and use of IRI aliases. | |||
Unnecessary aliases can be reduced, regardless of the comparison | Unnecessary aliases can be reduced, regardless of the comparison | |||
method, by consistently providing IRI references in an already | method, by consistently providing IRI references in a canonical form | |||
normalized form (i.e., a form identical to what would be produced | (after canonicalization is applied). | |||
after normalization is applied, as described below). Protocols and | ||||
data formats often limit some IRI comparisons to simple string | ||||
comparison, based on the theory that people and implementations will, | ||||
in their own best interest, be consistent in providing IRI | ||||
references, or at least be consistent enough to negate any efficiency | ||||
that might be obtained from further normalization. | ||||
5.2. Syntax-Based Equivalence | Protocols and data formats might limit some IRI comparisons to simple | |||
string comparison, based on the theory that people and | ||||
implementations will, in their own best interest, be consistent in | ||||
providing IRI references, or at least be consistent enough to negate | ||||
any efficiency that might be obtained from further canonicalization. | ||||
4.2. Syntax-Based Equivalence | ||||
Implementations may use logic based on the definitions provided by | Implementations may use logic based on the definitions provided by | |||
this specification to reduce the probability of false negatives. | this specification to reduce the probability of false negatives. | |||
This processing is moderately higher in cost than character-for- | This processing is moderately higher in cost than character-for- | |||
character string comparison. For example, an application using this | character string comparison. For example, an application using this | |||
approach could reasonably consider the following two IRIs equivalent: | approach could reasonably consider the following two IRIs equivalent: | |||
example://a/b/c/%7Bfoo%7D/rosé | example://a/b/c/%7Bfoo%7D/rosé | |||
eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 | eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 | |||
Web user agents, such as browsers, typically apply this type of IRI | Web user agents, such as browsers, typically apply this type of IRI | |||
normalization when determining whether a cached response is | equivalence when determining whether a cached response is available. | |||
available. Syntax-based normalization includes such techniques as | Syntax-based equivalence includes such techniques as case | |||
case normalization, character normalization, percent-encoding | equivalence, Unicode character normalization, percent-encoding | |||
normalization, and removal of dot-segments. | equivalence, and removal of dot-segments. | |||
5.2.1. Case Equivalence | 4.2.1. Case Equivalence | |||
For all IRIs, the hexadecimal digits within a percent-encoding | For all IRIs, the hexadecimal digits within a percent-encoding | |||
triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore | triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore | |||
should be considered equivalent to forms which use uppercase letters | should be considered equivalent to forms which use uppercase letters | |||
for the digits A-F. | for the digits A-F. | |||
When an IRI uses components of the generic syntax, the component | When an IRI uses components of the generic syntax, the component | |||
syntax equivalence rules always apply; namely, that the scheme and | syntax equivalence rules always apply; namely, that the scheme and | |||
US-ASCII only host are case insensitive and therefore should be | US-ASCII only host are case insensitive and therefore should be | |||
normalized to lowercase. For example, the URI | treated equivalent to lowercase. For example, the URI | |||
"HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/". | "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/". | |||
Case equivalence for non-ASCII characters in IRI components that are | Case equivalence for non-ASCII characters in IRI components that are | |||
IDNs are discussed in Section 5.3. The other generic syntax | IDNs are discussed in Section 4.3. The other generic syntax | |||
components are assumed to be case sensitive unless specifically | components are assumed to be case sensitive unless specifically | |||
defined otherwise by the scheme. | defined otherwise by the scheme. | |||
Creating schemes that allow case-insensitive syntax components | Creating schemes that allow case-insensitive syntax components | |||
containing non-ASCII characters should be avoided. Case | containing non-ASCII characters should be avoided. Case equivalence | |||
normalization of non-ASCII characters can be culturally dependent and | of non-ASCII characters can be culturally dependent and is always a | |||
is always a complex operation. The only exception concerns non-ASCII | complex operation. The only exception concerns non-ASCII host names | |||
host names for which the character normalization includes a mapping | for which the character normalization includes a mapping step derived | |||
step derived from case folding. | from case folding. | |||
5.2.2. Unicode Character Normalization | 4.2.2. Unicode Character Normalization | |||
The Unicode Standard [UNIV6] defines various equivalences between | The Unicode Standard [UNIV6] defines various equivalences between | |||
sequences of characters for various purposes. Unicode Standard Annex | sequences of characters for various purposes. Unicode Standard Annex | |||
#15 [UTR15] defines various Normalization Forms for these | #15 [UTR15] defines various Normalization Forms for these | |||
equivalences, in particular Normalization Form C (NFC, Canonical | equivalences, in particular Normalization Form C (NFC, Canonical | |||
Decomposition, followed by Canonical Composition) and Normalization | Decomposition, followed by Canonical Composition) and Normalization | |||
Form KC (NFKC, Compatibility Decomposition, followed by Canonical | Form KC (NFKC, Compatibility Decomposition, followed by Canonical | |||
Composition). | Composition). | |||
IRIs already in Unicode MUST NOT be normalized before parsing or | IRIs already in Unicode MUST NOT be normalized before parsing or | |||
skipping to change at page 9, line 43 | skipping to change at page 9, line 44 | |||
unclear whether they are case sensitive, case insensitive, or | unclear whether they are case sensitive, case insensitive, or | |||
something in between (e.g., case sensitive, but with a multiple | something in between (e.g., case sensitive, but with a multiple | |||
choice selection if the wrong case is used, instead of a direct | choice selection if the wrong case is used, instead of a direct | |||
negative result). The best recipe is that the creator use a | negative result). The best recipe is that the creator use a | |||
reasonable capitalization and, when transferring the URI, | reasonable capitalization and, when transferring the URI, | |||
capitalization never be changed. | capitalization never be changed. | |||
Various IRI schemes may allow the usage of Internationalized Domain | Various IRI schemes may allow the usage of Internationalized Domain | |||
Names (IDN) [RFC5890] either in the ireg-name part or elsewhere. | Names (IDN) [RFC5890] either in the ireg-name part or elsewhere. | |||
Character Normalization also applies to IDNs, as discussed in | Character Normalization also applies to IDNs, as discussed in | |||
Section 5.3. | Section 4.3. | |||
5.2.3. Percent-Encoding Equivalence | 4.2.3. Percent-Encoding Equivalence | |||
The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a | The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a | |||
frequent source of variance among otherwise identical IRIs. In | frequent source of variance among otherwise identical IRIs. In | |||
addition to the case normalization issue noted above, some IRI | addition to the case equivalence issue noted above, some IRI | |||
producers percent-encode octets that do not require percent-encoding, | producers percent-encode octets that do not require percent-encoding, | |||
resulting in IRIs that are equivalent to their nonencoded | resulting in IRIs that are equivalent to their nonencoded | |||
counterparts. These IRIs should be normalized by decoding any | counterparts. These IRIs should be compared by first decoding any | |||
percent-encoded octet sequence that corresponds to an unreserved | percent-encoded octet sequence that corresponds to an unreserved | |||
character, as described in section 2.3 of [RFC3986]. | character, as described in section 2.3 of [RFC3986]. | |||
For actual resolution, differences in percent-encoding (except for | For actual resolution, differences in percent-encoding (except for | |||
the percent-encoding of reserved characters) MUST always result in | the percent-encoding of reserved characters) SHOULD always result in | |||
the same resource. For example, "http://example.org/~user", | the same resource. For example, "http://example.org/~user", | |||
"http://example.org/%7euser", and "http://example.org/%7Euser", must | "http://example.org/%7euser", and "http://example.org/%7Euser", | |||
resolve to the same resource. | SHOULD resolve to the same resource. | |||
If this kind of equivalence is to be tested, the percent-encoding of | If this kind of equivalence is to be tested, the percent-encoding of | |||
both IRIs to be compared has to be aligned; for example, by | both IRIs to be compared first needs to be aligned; for example, by | |||
converting both IRIs to URIs, eliminating escape differences in the | converting both IRIs to URIs, eliminating escape differences in the | |||
resulting URIs, and making sure that the case of the hexadecimal | resulting URIs, and making sure that the case of the hexadecimal | |||
characters in the percent-encoding is always the same (preferably | characters in the percent-encoding is always the same (preferably | |||
upper case). If the IRI is to be passed to another application or | upper case). If the IRI is to be passed to another application or | |||
used further in some other way, its original form MUST be preserved. | used further in some other way, its original form MUST be preserved. | |||
The conversion described here should be performed only for local | The conversion described here should be performed only for local | |||
comparison. | comparison. | |||
5.2.4. Path Segment Equivalence | 4.2.4. Path Segment Equivalence | |||
The complete path segments "." and ".." are intended only for use | The complete path segments "." and ".." are intended only for use | |||
within relative references (Section 4.1 of [RFC3986]) and are removed | within relative references (Section 4.1 of [RFC3986]) and are removed | |||
as part of the reference resolution process (Section 5.2 of | as part of the reference resolution process (Section 5.2 of | |||
[RFC3986]). However, some implementations may incorrectly assume | [RFC3986]). However, some implementations may incorrectly assume | |||
that reference resolution is not necessary when the reference is | that reference resolution is not necessary when the reference is | |||
already an IRI, and thus fail to remove dot-segments when they occur | already an IRI, and thus fail to remove dot-segments when they occur | |||
in non-relative paths. IRI normalizers should remove dot-segments by | in non-relative paths. IRI comparison SHOULD remove dot-segments by | |||
applying the remove_dot_segments algorithm to the path, as described | applying the remove_dot_segments algorithm to the path, as described | |||
in Section 5.2.4 of [RFC3986]. | in Section 5.2.4 of [RFC3986]. | |||
5.3. Scheme-Based Comparison | 4.3. Scheme-Based Comparison | |||
The syntax and semantics of IRIs vary from scheme to scheme, as | The syntax and semantics of IRIs vary from scheme to scheme, as | |||
described by the defining specification for each scheme. | described by the defining specification for each scheme. | |||
Implementations may use scheme-specific rules, at further processing | Implementations may use scheme-specific rules, at further processing | |||
cost, to reduce the probability of false negatives. For example, | cost, to reduce the probability of false negatives. For example, | |||
because the "http" scheme makes use of an authority component, has a | because the "http" scheme makes use of an authority component, has a | |||
default port of "80", and defines an empty path to be equivalent to | default port of "80", and defines an empty path to be equivalent to | |||
"/", the following four IRIs are equivalent: | "/", the following four IRIs are equivalent: | |||
http://example.com | http://example.com | |||
http://example.com/ | http://example.com/ | |||
http://example.com:/ | http://example.com:/ | |||
http://example.com:80/ | http://example.com:80/ | |||
In general, an IRI that uses the generic syntax for authority with an | In general, an IRI that uses the generic syntax for authority with an | |||
empty path should be normalized to a path of "/". Likewise, an | empty path should be equivalent to a path of "/". Likewise, an | |||
explicit ":port", for which the port is empty or the default for the | explicit ":port", for which the port is empty or the default for the | |||
scheme, is equivalent to one where the port and its ":" delimiter are | scheme, is equivalent to one where the port and its ":" delimiter are | |||
elided and thus should be removed by scheme-based normalization. For | elided. | |||
example, the second IRI above is the normal form for the "http" | ||||
scheme. | ||||
Another case where normalization varies by scheme is in the handling | Another case where equivalence varies by scheme is in the handling of | |||
of an empty authority component or empty host subcomponent. For many | an empty authority component or empty host subcomponent. For many | |||
scheme specifications, an empty authority or host is considered an | scheme specifications, an empty authority or host is considered an | |||
error; for others, it is considered equivalent to "localhost" or the | error; for others, it is considered equivalent to "localhost" or the | |||
end-user's host. When a scheme defines a default for authority and | end-user's host. | |||
an IRI reference to that default is desired, the reference should be | ||||
normalized to an empty authority for the sake of uniformity, brevity, | ||||
and internationalization. If, however, either the userinfo or port | ||||
subcomponents are non-empty, then the host should be given explicitly | ||||
even if it matches the default. | ||||
Normalization should not remove delimiters when their associated | The presence of a missing component vs. one with an empty string | |||
component is empty unless it is licensed to do so by the scheme | component in an IRI SHOULD NOT be treated as equivalent unless | |||
specification. For example, the IRI "http://example.com/?" cannot be | explicitly defined as such by the scheme definition. For example, | |||
assumed to be equivalent to any of the examples above. Likewise, the | the IRI "http://example.com/?" cannot be assumed to be equivalent to | |||
presence or absence of delimiters within a userinfo subcomponent is | any of the examples above; an empty query component is NOT equivalent | |||
usually significant to its interpretation. The fragment component is | to a missing one. Likewise, the presence or absence of delimiters | |||
not subject to any scheme-based normalization; thus, two IRIs that | within a userinfo subcomponent is usually significant to its | |||
differ only by the suffix "#" are considered different regardless of | interpretation. The fragment component is not subject to any scheme- | |||
the scheme. | based equivalence; thus, two IRIs that differ only by the suffix "#" | |||
are considered different regardless of the scheme. | ||||
Some IRI schemes allow the usage of Internationalized Domain Names | Some IRI schemes allow the usage of Internationalized Domain Names | |||
(IDN) [RFC5890] either in their ireg-name part or elswhere. When in | (IDN) [RFC5890] either in their ireg-name part or elswhere. When in | |||
use in IRIs, those names SHOULD conform to the definition of U-Label | use in IRIs, those names SHOULD conform to the definition of U-Label | |||
in [RFC5890]. An IRI containing an invalid IDN cannot successfully | in [RFC5890]. An IRI containing an invalid IDN cannot successfully | |||
be resolved. For legibility purposes, they SHOULD NOT be converted | be resolved. For legibility purposes, they SHOULD NOT be converted | |||
into ASCII Compatible Encoding (ACE). | into ASCII Compatible Encoding (ACE). | |||
Scheme-based normalization may also consider IDN components and their | Scheme-based comparison may also consider IDN components and their | |||
conversions to punycode as equivalent. As an example, | conversions to punycode as equivalent. As an example, | |||
"http://résumé.example.org" may be considered equivalent to | "http://résumé.example.org" may be considered equivalent to | |||
"http://xn--rsum-bpad.example.org". | "http://xn--rsum-bpad.example.org". | |||
Other scheme-specific normalizations are possible. | Other scheme-specific equivalence rules are possible. | |||
5.4. Protocol-Based Comparison | 4.4. Protocol-Based Comparison | |||
Substantial effort to reduce the incidence of false negatives is | Substantial effort to reduce the incidence of false negatives is | |||
often cost-effective for web spiders. Consequently, they implement | often cost-effective for web spiders. Consequently, they implement | |||
even more aggressive techniques in IRI comparison. For example, if | even more aggressive techniques in IRI comparison. For example, if | |||
they observe that an IRI such as | they observe that an IRI such as | |||
http://example.com/data | ||||
redirects to an IRI differing only in the trailing slash | http://example.com/data | |||
http://example.com/data/ | redirects to an IRI differing only in the trailing slash | |||
http://example.com/data/ | ||||
they will likely regard the two as equivalent in the future. This | they will likely regard the two as equivalent in the future. This | |||
kind of technique is only appropriate when equivalence is clearly | kind of technique is only appropriate when equivalence is clearly | |||
indicated by both the result of accessing the resources and the | indicated by both the result of accessing the resources and the | |||
common conventions of their scheme's dereference algorithm (in this | common conventions of their scheme's dereference algorithm (in this | |||
case, use of redirection by HTTP origin servers to avoid problems | case, use of redirection by HTTP origin servers to avoid problems | |||
with relative references). | with relative references). | |||
6. Security Considerations | 5. Security Considerations | |||
The primary security difficulty comes from applications choosing the | The primary security difficulty comes from applications choosing the | |||
wrong equivalence relationship, or two different parties disagreeing | wrong equivalence relationship, or two different parties disagreeing | |||
on equivalence. This is especially a problem when IRIs are used in | on equivalence. This is especially a problem when IRIs are used in | |||
security protocols. | security protocols. | |||
Besides the large character repertoire of Unicode, reasons for | Besides the large character repertoire of Unicode, reasons for | |||
confusion include different forms of normalization and different | confusion include different forms of normalization and different | |||
normalization expectations, use of percent-encoding with various | normalization expectations, use of percent-encoding with various | |||
legacy encodings, and bidirectionality issues. See also [UTR36]. | legacy encodings, and bidirectionality issues. See also [UTR36]. | |||
7. Acknowledgements | 6. Acknowledgements | |||
This document was originally derived from [RFC3986] and [RFC3987], | This document was originally derived from [RFC3986] and [RFC3987], | |||
based on text contributed by Tim Bray. | based on text contributed by Tim Bray. | |||
8. References | 7. References | |||
8.1. Normative References | 7.1. Normative References | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, | [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, | |||
"Internationalizing Domain Names in Applications (IDNA)", | "Internationalizing Domain Names in Applications (IDNA)", | |||
RFC 3490, March 2003. | RFC 3490, March 2003. | |||
[RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | |||
Profile for Internationalized Domain Names (IDN)", | Profile for Internationalized Domain Names (IDN)", | |||
skipping to change at page 13, line 14 | skipping to change at page 13, line 7 | |||
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | |||
10646", STD 63, RFC 3629, November 2003. | 10646", STD 63, RFC 3629, November 2003. | |||
[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform | [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform | |||
Resource Identifier (URI): Generic Syntax", STD 66, | Resource Identifier (URI): Generic Syntax", STD 66, | |||
RFC 3986, January 2005. | RFC 3986, January 2005. | |||
[RFC3987bis] | [RFC3987bis] | |||
Duerst, M., Masinter, L., and M. Suignard, | Duerst, M., Masinter, L., and M. Suignard, | |||
"Internationalized Resource Identifiers (IRIs)", 2011, | "Internationalized Resource Identifiers (IRIs)", 2012, | |||
<http://tools.ietf.org/id/draft-ietf-iri-3987bis>. | <http://tools.ietf.org/id/draft-ietf-iri-3987bis>. | |||
[RFC5890] Klensin, J., "Internationalized Domain Names for | [RFC5890] Klensin, J., "Internationalized Domain Names for | |||
Applications (IDNA): Definitions and Document Framework", | Applications (IDNA): Definitions and Document Framework", | |||
RFC 5890, August 2010. | RFC 5890, August 2010. | |||
[UNIV6] The Unicode Consortium, "The Unicode Standard, Version | [UNIV6] The Unicode Consortium, "The Unicode Standard, Version | |||
6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, | 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, | |||
ISBN 978-1-936213-01-6)", October 2010. | ISBN 978-1-936213-01-6)", October 2010. | |||
[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | |||
Unicode Standard Annex #15, March 2008, | Unicode Standard Annex #15, March 2008, | |||
<http://www.unicode.org/unicode/reports/tr15/ | <http://www.unicode.org/unicode/reports/tr15/ | |||
tr15-23.html>. | tr15-23.html>. | |||
8.2. Informative References | 7.2. Informative References | |||
[HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 | [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 | |||
Specification", World Wide Web Consortium Recommendation, | Specification", World Wide Web Consortium Recommendation, | |||
December 1999, | December 1999, | |||
<http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>. | <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>. | |||
[RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | |||
Extensions (MIME) Part One: Format of Internet Message | Extensions (MIME) Part One: Format of Internet Message | |||
Bodies", RFC 2045, November 1996. | Bodies", RFC 2045, November 1996. | |||
End of changes. 66 change blocks. | ||||
190 lines changed or deleted | 188 lines changed or added | |||
This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |