draft-ietf-iri-comparison-00.txt | draft-ietf-iri-comparison-01.txt | |||
---|---|---|---|---|
Internationalized Resource Identifiers L. Masinter | Internationalized Resource Identifiers L. Masinter | |||
(iri) Adobe | (iri) Adobe | |||
Internet-Draft M. Duerst | Internet-Draft M. Duerst | |||
Intended status: Standards Track Aoyama Gakuin University | Intended status: Standards Track Aoyama Gakuin University | |||
Expires: February 15, 2012 August 14, 2011 | Expires: September 3, 2012 March 2, 2012 | |||
Equivalence and Canonicalization of Internationalized Resource | Equivalence and Canonicalization of Internationalized Resource | |||
Identifiers (IRIs) | Identifiers (IRIs) | |||
draft-ietf-iri-comparison-00 | draft-ietf-iri-comparison-01 | |||
Abstract | Abstract | |||
Internationalized Resource Identifiers (IRIs) are unicode strings | Internationalized Resource Identifiers (IRIs) are unicode strings | |||
used to identify resources on the Internet. Applications that use | used to identify resources on the Internet. Applications that use | |||
IRIs often define a means of comparing two IRIs to determine when two | IRIs often define a means of comparing two IRIs to determine when two | |||
IRIs are equivalent for the purpose of that application. Some | IRIs are equivalent for the purpose of that application. Some | |||
applications also define a method for 'canonicalizing' or | applications also define a method for 'canonicalizing' or | |||
'normalizing' an IRI -- translating one IRI into another which is | 'normalizing' an IRI -- translating one IRI into another which is | |||
equivalent under the comparison method used. | equivalent under the comparison method used. | |||
skipping to change at page 1, line 42 | skipping to change at page 1, line 42 | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on February 15, 2012. | This Internet-Draft will expire on September 3, 2012. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2011 IETF Trust and the persons identified as the | Copyright (c) 2012 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
skipping to change at page 3, line 8 | skipping to change at page 3, line 8 | |||
Without obtaining an adequate license from the person(s) controlling | Without obtaining an adequate license from the person(s) controlling | |||
the copyright in such materials, this document may not be modified | the copyright in such materials, this document may not be modified | |||
outside the IETF Standards Process, and derivative works of it may | outside the IETF Standards Process, and derivative works of it may | |||
not be created outside the IETF Standards Process, except to format | not be created outside the IETF Standards Process, except to format | |||
it for publication as an RFC or to translate it into languages other | it for publication as an RFC or to translate it into languages other | |||
than English. | than English. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
2. Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 5 | 2. Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
3. Preparation for Comparison . . . . . . . . . . . . . . . . . . 6 | 3. Comparison, Equivalence, Normalization and | |||
4. Comparison Ladder . . . . . . . . . . . . . . . . . . . . . . 6 | Canonicalization . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
4.1. Simple String Comparison . . . . . . . . . . . . . . . . . 7 | 4. Preparation for Comparison . . . . . . . . . . . . . . . . . . 5 | |||
4.2. Syntax-Based Normalization . . . . . . . . . . . . . . . . 8 | 5. Comparison Ladder . . . . . . . . . . . . . . . . . . . . . . 6 | |||
4.2.1. Case Normalization . . . . . . . . . . . . . . . . . . 8 | 5.1. Simple String Comparison . . . . . . . . . . . . . . . . . 6 | |||
4.2.2. Character Normalization . . . . . . . . . . . . . . . 8 | 5.2. Syntax-Based Equivalence . . . . . . . . . . . . . . . . . 7 | |||
4.2.3. Percent-Encoding Normalization . . . . . . . . . . . . 10 | 5.2.1. Case Equivalence . . . . . . . . . . . . . . . . . . . 8 | |||
4.2.4. Path Segment Normalization . . . . . . . . . . . . . . 10 | 5.2.2. Unicode Character Normalization . . . . . . . . . . . 8 | |||
4.3. Scheme-Based Normalization . . . . . . . . . . . . . . . . 10 | 5.2.3. Percent-Encoding Equivalence . . . . . . . . . . . . . 9 | |||
4.4. Protocol-Based Normalization . . . . . . . . . . . . . . . 12 | 5.2.4. Path Segment Equivalence . . . . . . . . . . . . . . . 10 | |||
5. Security Considerations . . . . . . . . . . . . . . . . . . . 12 | 5.3. Scheme-Based Comparison . . . . . . . . . . . . . . . . . 10 | |||
6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 | 5.4. Protocol-Based Comparison . . . . . . . . . . . . . . . . 11 | |||
7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 | 6. Security Considerations . . . . . . . . . . . . . . . . . . . 12 | |||
7.1. Normative References . . . . . . . . . . . . . . . . . . . 13 | 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
7.2. Informative References . . . . . . . . . . . . . . . . . . 13 | 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
8.1. Normative References . . . . . . . . . . . . . . . . . . . 12 | ||||
8.2. Informative References . . . . . . . . . . . . . . . . . . 13 | ||||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14 | |||
1. Introduction | 1. Introduction | |||
Internationalized Resource Identifiers (IRIs) are unicode strings | Internationalized Resource Identifiers (IRIs) are unicode strings | |||
used to identify resources on the Internet. Applications that use | used to identify resources on the Internet. Applications that use | |||
IRIs often define a means of comparing two IRIs to determine when two | IRIs often define a means of comparing two IRIs to determine when two | |||
IRIs are equivalent for the purpose of that application. Some | IRIs are equivalent for the purpose of that application. Some | |||
applications also define a method for 'canonicalizing' or | applications also define a method for 'canonicalizing' or | |||
'normalizing' an IRI -- translating one IRI into another which is | 'normalizing' an IRI -- translating one IRI into another which is | |||
equivalent under the comparison method used. | equivalent under the comparison method used. | |||
This document gives guidelines and best practices for defining and | This document gives guidelines and best practices for defining and | |||
using IRI comparison, equivalence, normalization and canonicalization | using IRI comparison, equivalence, normalization and canonicalization | |||
methods. | methods. | |||
Things to do: | ||||
o Introductory section on comparison, equivalence, normalization and | ||||
canonicalization. | ||||
o Verify acknowledgements for this component. | ||||
o Verify cross-references from other documents. | ||||
o Consider making 4395bis reference this document and recommend | ||||
scheme definitions describe equivalence specifically. | ||||
o Consider making this document 'update' 3986 in order to resolve | ||||
which one is normative if there are conflicts. | ||||
o alternatively? Consider making this document BCP rather than | ||||
standards track, since it basically gives guidance for protocols | ||||
and applications needing equivalence, and doesn't directly have a | ||||
scope of application? | ||||
o Distingish between IRIs as sequence-of-unicode characters and | ||||
presentations of IRIs. | ||||
o Should we insist that percent-hex encoding equivalence of non- | ||||
reserved characters MUST be always used if there is any | ||||
equivalence at all? | ||||
o Update security considerations to describe security concerns | ||||
specific to comparison. | ||||
o Consider making sections talk about 'equivalent' rather than | ||||
'normalization' where appropriate. | ||||
One of the most common operations on IRIs is simple comparison: | One of the most common operations on IRIs is simple comparison: | |||
Determining whether two IRIs are equivalent, without using the IRIs | Determining whether two IRIs are equivalent, without using the IRIs | |||
to access their respective resource(s). A comparison is performed | to access their respective resource(s). A comparison is performed | |||
whenever a response cache is accessed, a browser checks its history | whenever a response cache is accessed, a browser checks its history | |||
to color a link, or an XML parser processes tags within a namespace. | to color a link, or an XML parser processes tags within a namespace. | |||
Extensive normalization prior to comparison of IRIs may be used by | Extensive normalization prior to comparison of IRIs may be used by | |||
spiders and indexing engines to prune a search space or reduce | spiders and indexing engines to prune a search space or reduce | |||
duplication of request actions and response storage. | duplication of request actions and response storage. | |||
IRI comparison is performed for some particular purpose. Protocols | IRI comparison is performed for some particular purpose. Protocols | |||
or implementations that compare IRIs for different purposes will | or implementations that compare IRIs for different purposes will | |||
skipping to change at page 5, line 29 | skipping to change at page 4, line 44 | |||
use them. | use them. | |||
2. Equivalence | 2. Equivalence | |||
Because IRIs exist to identify resources, presumably they should be | Because IRIs exist to identify resources, presumably they should be | |||
considered equivalent when they identify the same resource. However, | considered equivalent when they identify the same resource. However, | |||
this definition of equivalence is not of much practical use, as there | this definition of equivalence is not of much practical use, as there | |||
is no way for an implementation to compare two resources to determine | is no way for an implementation to compare two resources to determine | |||
if they are "the same" unless it has full knowledge or control of | if they are "the same" unless it has full knowledge or control of | |||
them. For this reason, determination of equivalence or difference of | them. For this reason, determination of equivalence or difference of | |||
IRIs is based on string comparison, perhaps augmented by reference to | IRIs is based on string comparison, augmented by reference to | |||
additional rules provided by URI scheme definitions. We use the | additional rules provided by scheme definition. We use the terms | |||
terms "different" and "equivalent" to describe the possible outcomes | "different" and "equivalent" to describe the possible outcomes of | |||
of such comparisons, but there are many application-dependent | such comparisons, but there are many application-dependent versions | |||
versions of equivalence. | of equivalence. | |||
Even when it is possible to determine that two IRIs are equivalent, | Even when it is possible to determine that two IRIs are equivalent, | |||
IRI comparison is not sufficient to determine whether two IRIs | IRI comparison is not sufficient to determine whether two IRIs | |||
identify different resources. For example, an owner of two different | identify different resources. For example, an owner of two different | |||
domain names could decide to serve the same resource from both, | domain names could decide to serve the same resource from both, | |||
resulting in two different IRIs. Therefore, comparison methods are | resulting in two different IRIs. Therefore, comparison methods are | |||
designed to minimize false negatives while strictly avoiding false | designed to minimize false negatives while strictly avoiding false | |||
positives. | positives. | |||
In testing for equivalence, applications should not directly compare | In testing for equivalence, applications should not directly compare | |||
relative references; the references should be converted to their | relative references; the references should be converted to their | |||
respective target IRIs before comparison. When IRIs are compared to | respective target IRIs before comparison. When IRIs are compared to | |||
select (or avoid) a network action, such as retrieval of a | select (or avoid) a network action, such as retrieval of a | |||
representation, fragment components (if any) MUST be excluded from | representation, fragment components (if any) MUST be excluded from | |||
the comparison. | the comparison. | |||
Applications using IRIs as identity tokens with no relationship to a | Applications using IRIs as identity tokens with no relationship to a | |||
protocol MUST use the Simple String Comparison (see Section 4.1). | protocol MUST use the Simple String Comparison (see Section 5.1). | |||
All other applications MUST select one of the comparison practices | All other applications MUST select one of the comparison practices | |||
from the Comparison Ladder (see Section 4. | from the Comparison Ladder (see Section 5. | |||
3. Preparation for Comparison | 3. Comparison, Equivalence, Normalization and Canonicalization | |||
In general, when considering a set of items or strings, there are | ||||
several interrelated concepts. A comparison method determines, | ||||
between two items in the set, their relationship. In particular, a | ||||
comparison method for determining equivalence might result in a | ||||
determination that two (different) items are equivalent, known to be | ||||
different, or that equivalence isn't determined. | ||||
One way to define a comparison for equivalence is to define a a | ||||
normalization or canonicalization algorithm. For each item in a set | ||||
of equivalent items, one of them could be designated the "normal" or | ||||
"canonical" form. | ||||
These general concepts are used with IRIs in this document, and in | ||||
other circumstances, where a mapping from one sequence of Unicode | ||||
characters to another one could be described as a "normalization" | ||||
algorithm. | ||||
In general, this document tries to stay with the "equivalence" or | ||||
"comparison" methods, become some times the mathematical notion of | ||||
"normalization" results in forms that ordinary users might not | ||||
consider "normal" in an ordinary sense. | ||||
4. Preparation for Comparison | ||||
Any kind of IRI comparison REQUIRES that any additional contextual | Any kind of IRI comparison REQUIRES that any additional contextual | |||
processing is first performed, including undoing higher-level | processing is first performed, including undoing higher-level | |||
escapings or encodings in the protocol or format that carries an IRI. | escapings or encodings in the protocol or format that carries an IRI. | |||
This preprocessing is usually done when the protocol or format is | This preprocessing is usually done when the protocol or format is | |||
parsed. | parsed. | |||
Examples of such escapings or encodings are entities and numeric | Examples of such escapings or encodings are entities and numeric | |||
character references in [HTML4] and [XML1]. As an example, | character references in [HTML4] and [XML1]. As an example, | |||
"http://example.org/rosé" (in HTML), | "http://example.org/rosé" (in HTML), | |||
skipping to change at page 6, line 31 | skipping to change at page 6, line 23 | |||
what is denoted in this document (see 'Notation' section of | what is denoted in this document (see 'Notation' section of | |||
[RFC3987bis]) as "http://example.org/rosé" (the "é" here | [RFC3987bis]) as "http://example.org/rosé" (the "é" here | |||
standing for the actual e-acute character, to compensate for the fact | standing for the actual e-acute character, to compensate for the fact | |||
that this document cannot contain non-ASCII characters). | that this document cannot contain non-ASCII characters). | |||
Similar considerations apply to encodings such as Transfer Codings in | Similar considerations apply to encodings such as Transfer Codings in | |||
HTTP (see [RFC2616]) and Content Transfer Encodings in MIME | HTTP (see [RFC2616]) and Content Transfer Encodings in MIME | |||
([RFC2045]), although in these cases, the encoding is based not on | ([RFC2045]), although in these cases, the encoding is based not on | |||
characters but on octets, and additional care is required to make | characters but on octets, and additional care is required to make | |||
sure that characters, and not just arbitrary octets, are compared | sure that characters, and not just arbitrary octets, are compared | |||
(see Section 4.1). | (see Section 5.1). | |||
4. Comparison Ladder | 5. Comparison Ladder | |||
In practice, a variety of methods are used to test IRI equivalence. | In practice, a variety of methods are used to test IRI equivalence. | |||
These methods fall into a range distinguished by the amount of | These methods fall into a range distinguished by the amount of | |||
processing required and the degree to which the probability of false | processing required and the degree to which the probability of false | |||
negatives is reduced. As noted above, false negatives cannot be | negatives is reduced. As noted above, false negatives cannot be | |||
eliminated. In practice, their probability can be reduced, but this | eliminated. In practice, their probability can be reduced, but this | |||
reduction requires more processing and is not cost-effective for all | reduction requires more processing and is not cost-effective for all | |||
applications. | applications. | |||
If this range of comparison practices is considered as a ladder, the | If this range of comparison practices is considered as a ladder, the | |||
following discussion will climb the ladder, starting with practices | following discussion will climb the ladder, starting with practices | |||
that are cheap but have a relatively higher chance of producing false | that are cheap but have a relatively higher chance of producing false | |||
negatives, and proceeding to those that have higher computational | negatives, and proceeding to those that have higher computational | |||
cost and lower risk of false negatives. | cost and lower risk of false negatives. | |||
4.1. Simple String Comparison | 5.1. Simple String Comparison | |||
If two IRIs, when considered as character strings, are identical, | If two IRIs, when considered as character strings, are identical, | |||
then it is safe to conclude that they are equivalent. This type of | then it is safe to conclude that they are equivalent. This type of | |||
equivalence test has very low computational cost and is in wide use | equivalence test has very low computational cost and is in wide use | |||
in a variety of applications, particularly in the domain of parsing. | in a variety of applications, particularly in the domain of parsing. | |||
It is also used when a definitive answer to the question of IRI | It is also used when a definitive answer to the question of IRI | |||
equivalence is needed that is independent of the scheme used and that | equivalence is needed that is independent of the scheme used and that | |||
can be calculated quickly and without accessing a network. An | can be calculated quickly and without accessing a network. An | |||
example of such a case is XML Namespaces ([XMLNamespace]). | example of such a case is XML Namespaces ([XMLNamespace]). | |||
skipping to change at page 8, line 5 | skipping to change at page 7, line 40 | |||
Unnecessary aliases can be reduced, regardless of the comparison | Unnecessary aliases can be reduced, regardless of the comparison | |||
method, by consistently providing IRI references in an already | method, by consistently providing IRI references in an already | |||
normalized form (i.e., a form identical to what would be produced | normalized form (i.e., a form identical to what would be produced | |||
after normalization is applied, as described below). Protocols and | after normalization is applied, as described below). Protocols and | |||
data formats often limit some IRI comparisons to simple string | data formats often limit some IRI comparisons to simple string | |||
comparison, based on the theory that people and implementations will, | comparison, based on the theory that people and implementations will, | |||
in their own best interest, be consistent in providing IRI | in their own best interest, be consistent in providing IRI | |||
references, or at least be consistent enough to negate any efficiency | references, or at least be consistent enough to negate any efficiency | |||
that might be obtained from further normalization. | that might be obtained from further normalization. | |||
4.2. Syntax-Based Normalization | 5.2. Syntax-Based Equivalence | |||
Implementations may use logic based on the definitions provided by | Implementations may use logic based on the definitions provided by | |||
this specification to reduce the probability of false negatives. | this specification to reduce the probability of false negatives. | |||
This processing is moderately higher in cost than character-for- | This processing is moderately higher in cost than character-for- | |||
character string comparison. For example, an application using this | character string comparison. For example, an application using this | |||
approach could reasonably consider the following two IRIs equivalent: | approach could reasonably consider the following two IRIs equivalent: | |||
example://a/b/c/%7Bfoo%7D/rosé | example://a/b/c/%7Bfoo%7D/rosé | |||
eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 | eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 | |||
Web user agents, such as browsers, typically apply this type of IRI | Web user agents, such as browsers, typically apply this type of IRI | |||
normalization when determining whether a cached response is | normalization when determining whether a cached response is | |||
available. Syntax-based normalization includes such techniques as | available. Syntax-based normalization includes such techniques as | |||
case normalization, character normalization, percent-encoding | case normalization, character normalization, percent-encoding | |||
normalization, and removal of dot-segments. | normalization, and removal of dot-segments. | |||
4.2.1. Case Normalization | 5.2.1. Case Equivalence | |||
For all IRIs, the hexadecimal digits within a percent-encoding | For all IRIs, the hexadecimal digits within a percent-encoding | |||
triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore | triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore | |||
should be normalized to use uppercase letters for the digits A-F. | should be considered equivalent to forms which use uppercase letters | |||
for the digits A-F. | ||||
When an IRI uses components of the generic syntax, the component | When an IRI uses components of the generic syntax, the component | |||
syntax equivalence rules always apply; namely, that the scheme and | syntax equivalence rules always apply; namely, that the scheme and | |||
US-ASCII only host are case insensitive and therefore should be | US-ASCII only host are case insensitive and therefore should be | |||
normalized to lowercase. For example, the URI | normalized to lowercase. For example, the URI | |||
"HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/". | "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/". | |||
Case equivalence for non-ASCII characters in IRI components that are | Case equivalence for non-ASCII characters in IRI components that are | |||
IDNs are discussed in Section 4.3. The other generic syntax | IDNs are discussed in Section 5.3. The other generic syntax | |||
components are assumed to be case sensitive unless specifically | components are assumed to be case sensitive unless specifically | |||
defined otherwise by the scheme. | defined otherwise by the scheme. | |||
Creating schemes that allow case-insensitive syntax components | Creating schemes that allow case-insensitive syntax components | |||
containing non-ASCII characters should be avoided. Case | containing non-ASCII characters should be avoided. Case | |||
normalization of non-ASCII characters can be culturally dependent and | normalization of non-ASCII characters can be culturally dependent and | |||
is always a complex operation. The only exception concerns non-ASCII | is always a complex operation. The only exception concerns non-ASCII | |||
host names for which the character normalization includes a mapping | host names for which the character normalization includes a mapping | |||
step derived from case folding. | step derived from case folding. | |||
4.2.2. Character Normalization | 5.2.2. Unicode Character Normalization | |||
The Unicode Standard [UNIV6] defines various equivalences between | The Unicode Standard [UNIV6] defines various equivalences between | |||
sequences of characters for various purposes. Unicode Standard Annex | sequences of characters for various purposes. Unicode Standard Annex | |||
#15 [UTR15] defines various Normalization Forms for these | #15 [UTR15] defines various Normalization Forms for these | |||
equivalences, in particular Normalization Form C (NFC, Canonical | equivalences, in particular Normalization Form C (NFC, Canonical | |||
Decomposition, followed by Canonical Composition) and Normalization | Decomposition, followed by Canonical Composition) and Normalization | |||
Form KC (NFKC, Compatibility Decomposition, followed by Canonical | Form KC (NFKC, Compatibility Decomposition, followed by Canonical | |||
Composition). | Composition). | |||
IRIs already in Unicode MUST NOT be normalized before parsing or | IRIs already in Unicode MUST NOT be normalized before parsing or | |||
skipping to change at page 10, line 6 | skipping to change at page 9, line 43 | |||
unclear whether they are case sensitive, case insensitive, or | unclear whether they are case sensitive, case insensitive, or | |||
something in between (e.g., case sensitive, but with a multiple | something in between (e.g., case sensitive, but with a multiple | |||
choice selection if the wrong case is used, instead of a direct | choice selection if the wrong case is used, instead of a direct | |||
negative result). The best recipe is that the creator use a | negative result). The best recipe is that the creator use a | |||
reasonable capitalization and, when transferring the URI, | reasonable capitalization and, when transferring the URI, | |||
capitalization never be changed. | capitalization never be changed. | |||
Various IRI schemes may allow the usage of Internationalized Domain | Various IRI schemes may allow the usage of Internationalized Domain | |||
Names (IDN) [RFC5890] either in the ireg-name part or elsewhere. | Names (IDN) [RFC5890] either in the ireg-name part or elsewhere. | |||
Character Normalization also applies to IDNs, as discussed in | Character Normalization also applies to IDNs, as discussed in | |||
Section 4.3. | Section 5.3. | |||
4.2.3. Percent-Encoding Normalization | 5.2.3. Percent-Encoding Equivalence | |||
The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a | The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a | |||
frequent source of variance among otherwise identical IRIs. In | frequent source of variance among otherwise identical IRIs. In | |||
addition to the case normalization issue noted above, some IRI | addition to the case normalization issue noted above, some IRI | |||
producers percent-encode octets that do not require percent-encoding, | producers percent-encode octets that do not require percent-encoding, | |||
resulting in IRIs that are equivalent to their nonencoded | resulting in IRIs that are equivalent to their nonencoded | |||
counterparts. These IRIs should be normalized by decoding any | counterparts. These IRIs should be normalized by decoding any | |||
percent-encoded octet sequence that corresponds to an unreserved | percent-encoded octet sequence that corresponds to an unreserved | |||
character, as described in section 2.3 of [RFC3986]. | character, as described in section 2.3 of [RFC3986]. | |||
For actual resolution, differences in percent-encoding (except for | For actual resolution, differences in percent-encoding (except for | |||
the percent-encoding of reserved characters) MUST always result in | the percent-encoding of reserved characters) MUST always result in | |||
the same resource. For example, "http://example.org/~user", | the same resource. For example, "http://example.org/~user", | |||
"http://example.org/%7euser", and "http://example.org/%7Euser", must | "http://example.org/%7euser", and "http://example.org/%7Euser", must | |||
resolve to the same resource. | resolve to the same resource. | |||
If this kind of equivalence is to be tested, the percent-encoding of | If this kind of equivalence is to be tested, the percent-encoding of | |||
both IRIs to be compared has to be aligned; for example, by | both IRIs to be compared has to be aligned; for example, by | |||
converting both IRIs to URIs (see Section 3.1), eliminating escape | converting both IRIs to URIs, eliminating escape differences in the | |||
differences in the resulting URIs, and making sure that the case of | resulting URIs, and making sure that the case of the hexadecimal | |||
the hexadecimal characters in the percent-encoding is always the same | characters in the percent-encoding is always the same (preferably | |||
(preferably upper case). If the IRI is to be passed to another | upper case). If the IRI is to be passed to another application or | |||
application or used further in some other way, its original form MUST | used further in some other way, its original form MUST be preserved. | |||
be preserved. The conversion described here should be performed only | The conversion described here should be performed only for local | |||
for local comparison. | comparison. | |||
4.2.4. Path Segment Normalization | 5.2.4. Path Segment Equivalence | |||
The complete path segments "." and ".." are intended only for use | The complete path segments "." and ".." are intended only for use | |||
within relative references (Section 4.1 of [RFC3986]) and are removed | within relative references (Section 4.1 of [RFC3986]) and are removed | |||
as part of the reference resolution process (Section 5.2 of | as part of the reference resolution process (Section 5.2 of | |||
[RFC3986]). However, some implementations may incorrectly assume | [RFC3986]). However, some implementations may incorrectly assume | |||
that reference resolution is not necessary when the reference is | that reference resolution is not necessary when the reference is | |||
already an IRI, and thus fail to remove dot-segments when they occur | already an IRI, and thus fail to remove dot-segments when they occur | |||
in non-relative paths. IRI normalizers should remove dot-segments by | in non-relative paths. IRI normalizers should remove dot-segments by | |||
applying the remove_dot_segments algorithm to the path, as described | applying the remove_dot_segments algorithm to the path, as described | |||
in Section 5.2.4 of [RFC3986]. | in Section 5.2.4 of [RFC3986]. | |||
4.3. Scheme-Based Normalization | 5.3. Scheme-Based Comparison | |||
The syntax and semantics of IRIs vary from scheme to scheme, as | The syntax and semantics of IRIs vary from scheme to scheme, as | |||
described by the defining specification for each scheme. | described by the defining specification for each scheme. | |||
Implementations may use scheme-specific rules, at further processing | Implementations may use scheme-specific rules, at further processing | |||
cost, to reduce the probability of false negatives. For example, | cost, to reduce the probability of false negatives. For example, | |||
because the "http" scheme makes use of an authority component, has a | because the "http" scheme makes use of an authority component, has a | |||
default port of "80", and defines an empty path to be equivalent to | default port of "80", and defines an empty path to be equivalent to | |||
"/", the following four IRIs are equivalent: | "/", the following four IRIs are equivalent: | |||
http://example.com | http://example.com | |||
skipping to change at page 12, line 9 | skipping to change at page 11, line 45 | |||
be resolved. For legibility purposes, they SHOULD NOT be converted | be resolved. For legibility purposes, they SHOULD NOT be converted | |||
into ASCII Compatible Encoding (ACE). | into ASCII Compatible Encoding (ACE). | |||
Scheme-based normalization may also consider IDN components and their | Scheme-based normalization may also consider IDN components and their | |||
conversions to punycode as equivalent. As an example, | conversions to punycode as equivalent. As an example, | |||
"http://résumé.example.org" may be considered equivalent to | "http://résumé.example.org" may be considered equivalent to | |||
"http://xn--rsum-bpad.example.org". | "http://xn--rsum-bpad.example.org". | |||
Other scheme-specific normalizations are possible. | Other scheme-specific normalizations are possible. | |||
4.4. Protocol-Based Normalization | 5.4. Protocol-Based Comparison | |||
Substantial effort to reduce the incidence of false negatives is | Substantial effort to reduce the incidence of false negatives is | |||
often cost-effective for web spiders. Consequently, they implement | often cost-effective for web spiders. Consequently, they implement | |||
even more aggressive techniques in IRI comparison. For example, if | even more aggressive techniques in IRI comparison. For example, if | |||
they observe that an IRI such as | they observe that an IRI such as | |||
http://example.com/data | http://example.com/data | |||
redirects to an IRI differing only in the trailing slash | redirects to an IRI differing only in the trailing slash | |||
http://example.com/data/ | http://example.com/data/ | |||
they will likely regard the two as equivalent in the future. This | they will likely regard the two as equivalent in the future. This | |||
kind of technique is only appropriate when equivalence is clearly | kind of technique is only appropriate when equivalence is clearly | |||
indicated by both the result of accessing the resources and the | indicated by both the result of accessing the resources and the | |||
common conventions of their scheme's dereference algorithm (in this | common conventions of their scheme's dereference algorithm (in this | |||
skipping to change at page 12, line 29 | skipping to change at page 12, line 17 | |||
http://example.com/data/ | http://example.com/data/ | |||
they will likely regard the two as equivalent in the future. This | they will likely regard the two as equivalent in the future. This | |||
kind of technique is only appropriate when equivalence is clearly | kind of technique is only appropriate when equivalence is clearly | |||
indicated by both the result of accessing the resources and the | indicated by both the result of accessing the resources and the | |||
common conventions of their scheme's dereference algorithm (in this | common conventions of their scheme's dereference algorithm (in this | |||
case, use of redirection by HTTP origin servers to avoid problems | case, use of redirection by HTTP origin servers to avoid problems | |||
with relative references). | with relative references). | |||
5. Security Considerations | 6. Security Considerations | |||
The primary security difficulty comes from applications choosing the | The primary security difficulty comes from applications choosing the | |||
wrong equivalence relationship, or two different parties disagreeing | wrong equivalence relationship, or two different parties disagreeing | |||
on equivalence. This is especially a problem when IRIs are used in | on equivalence. This is especially a problem when IRIs are used in | |||
security protocols. | security protocols. | |||
Besides the large character repertoire of Unicode, reasons for | Besides the large character repertoire of Unicode, reasons for | |||
confusion include different forms of normalization and different | confusion include different forms of normalization and different | |||
normalization expectations, use of percent-encoding with various | normalization expectations, use of percent-encoding with various | |||
legacy encodings, and bidirectionality issues. See also [UTR36]. | legacy encodings, and bidirectionality issues. See also [UTR36]. | |||
6. Acknowledgements | 7. Acknowledgements | |||
This document was originally derived from [RFC3986] and [RFC3987], | This document was originally derived from [RFC3986] and [RFC3987], | |||
based on text contributed by Tim Bray. | based on text contributed by Tim Bray. | |||
7. References | 8. References | |||
7.1. Normative References | ||||
8.1. Normative References | ||||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, | [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, | |||
"Internationalizing Domain Names in Applications (IDNA)", | "Internationalizing Domain Names in Applications (IDNA)", | |||
RFC 3490, March 2003. | RFC 3490, March 2003. | |||
[RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | |||
Profile for Internationalized Domain Names (IDN)", | Profile for Internationalized Domain Names (IDN)", | |||
skipping to change at page 13, line 42 | skipping to change at page 13, line 30 | |||
[UNIV6] The Unicode Consortium, "The Unicode Standard, Version | [UNIV6] The Unicode Consortium, "The Unicode Standard, Version | |||
6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, | 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, | |||
ISBN 978-1-936213-01-6)", October 2010. | ISBN 978-1-936213-01-6)", October 2010. | |||
[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | |||
Unicode Standard Annex #15, March 2008, | Unicode Standard Annex #15, March 2008, | |||
<http://www.unicode.org/unicode/reports/tr15/ | <http://www.unicode.org/unicode/reports/tr15/ | |||
tr15-23.html>. | tr15-23.html>. | |||
7.2. Informative References | 8.2. Informative References | |||
[HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 | [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 | |||
Specification", World Wide Web Consortium Recommendation, | Specification", World Wide Web Consortium Recommendation, | |||
December 1999, | December 1999, | |||
<http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>. | <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>. | |||
[RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | |||
Extensions (MIME) Part One: Format of Internet Message | Extensions (MIME) Part One: Format of Internet Message | |||
Bodies", RFC 2045, November 1996. | Bodies", RFC 2045, November 1996. | |||
End of changes. 30 change blocks. | ||||
89 lines changed or deleted | 81 lines changed or added | |||
This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |