draft-ietf-iri-comparison-00.txt   draft-ietf-iri-comparison-01.txt 
Internationalized Resource Identifiers L. Masinter Internationalized Resource Identifiers L. Masinter
(iri) Adobe (iri) Adobe
Internet-Draft M. Duerst Internet-Draft M. Duerst
Intended status: Standards Track Aoyama Gakuin University Intended status: Standards Track Aoyama Gakuin University
Expires: February 15, 2012 August 14, 2011 Expires: September 3, 2012 March 2, 2012
Equivalence and Canonicalization of Internationalized Resource Equivalence and Canonicalization of Internationalized Resource
Identifiers (IRIs) Identifiers (IRIs)
draft-ietf-iri-comparison-00 draft-ietf-iri-comparison-01
Abstract Abstract
Internationalized Resource Identifiers (IRIs) are unicode strings Internationalized Resource Identifiers (IRIs) are unicode strings
used to identify resources on the Internet. Applications that use used to identify resources on the Internet. Applications that use
IRIs often define a means of comparing two IRIs to determine when two IRIs often define a means of comparing two IRIs to determine when two
IRIs are equivalent for the purpose of that application. Some IRIs are equivalent for the purpose of that application. Some
applications also define a method for 'canonicalizing' or applications also define a method for 'canonicalizing' or
'normalizing' an IRI -- translating one IRI into another which is 'normalizing' an IRI -- translating one IRI into another which is
equivalent under the comparison method used. equivalent under the comparison method used.
skipping to change at page 1, line 42 skipping to change at page 1, line 42
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 15, 2012. This Internet-Draft will expire on September 3, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 3, line 8 skipping to change at page 3, line 8
Without obtaining an adequate license from the person(s) controlling Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other it for publication as an RFC or to translate it into languages other
than English. than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 5 2. Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Preparation for Comparison . . . . . . . . . . . . . . . . . . 6 3. Comparison, Equivalence, Normalization and
4. Comparison Ladder . . . . . . . . . . . . . . . . . . . . . . 6 Canonicalization . . . . . . . . . . . . . . . . . . . . . . . 5
4.1. Simple String Comparison . . . . . . . . . . . . . . . . . 7 4. Preparation for Comparison . . . . . . . . . . . . . . . . . . 5
4.2. Syntax-Based Normalization . . . . . . . . . . . . . . . . 8 5. Comparison Ladder . . . . . . . . . . . . . . . . . . . . . . 6
4.2.1. Case Normalization . . . . . . . . . . . . . . . . . . 8 5.1. Simple String Comparison . . . . . . . . . . . . . . . . . 6
4.2.2. Character Normalization . . . . . . . . . . . . . . . 8 5.2. Syntax-Based Equivalence . . . . . . . . . . . . . . . . . 7
4.2.3. Percent-Encoding Normalization . . . . . . . . . . . . 10 5.2.1. Case Equivalence . . . . . . . . . . . . . . . . . . . 8
4.2.4. Path Segment Normalization . . . . . . . . . . . . . . 10 5.2.2. Unicode Character Normalization . . . . . . . . . . . 8
4.3. Scheme-Based Normalization . . . . . . . . . . . . . . . . 10 5.2.3. Percent-Encoding Equivalence . . . . . . . . . . . . . 9
4.4. Protocol-Based Normalization . . . . . . . . . . . . . . . 12 5.2.4. Path Segment Equivalence . . . . . . . . . . . . . . . 10
5. Security Considerations . . . . . . . . . . . . . . . . . . . 12 5.3. Scheme-Based Comparison . . . . . . . . . . . . . . . . . 10
6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 5.4. Protocol-Based Comparison . . . . . . . . . . . . . . . . 11
7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6. Security Considerations . . . . . . . . . . . . . . . . . . . 12
7.1. Normative References . . . . . . . . . . . . . . . . . . . 13 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12
7.2. Informative References . . . . . . . . . . . . . . . . . . 13 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12
8.1. Normative References . . . . . . . . . . . . . . . . . . . 12
8.2. Informative References . . . . . . . . . . . . . . . . . . 13
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14
1. Introduction 1. Introduction
Internationalized Resource Identifiers (IRIs) are unicode strings Internationalized Resource Identifiers (IRIs) are unicode strings
used to identify resources on the Internet. Applications that use used to identify resources on the Internet. Applications that use
IRIs often define a means of comparing two IRIs to determine when two IRIs often define a means of comparing two IRIs to determine when two
IRIs are equivalent for the purpose of that application. Some IRIs are equivalent for the purpose of that application. Some
applications also define a method for 'canonicalizing' or applications also define a method for 'canonicalizing' or
'normalizing' an IRI -- translating one IRI into another which is 'normalizing' an IRI -- translating one IRI into another which is
equivalent under the comparison method used. equivalent under the comparison method used.
This document gives guidelines and best practices for defining and This document gives guidelines and best practices for defining and
using IRI comparison, equivalence, normalization and canonicalization using IRI comparison, equivalence, normalization and canonicalization
methods. methods.
Things to do:
o Introductory section on comparison, equivalence, normalization and
canonicalization.
o Verify acknowledgements for this component.
o Verify cross-references from other documents.
o Consider making 4395bis reference this document and recommend
scheme definitions describe equivalence specifically.
o Consider making this document 'update' 3986 in order to resolve
which one is normative if there are conflicts.
o alternatively? Consider making this document BCP rather than
standards track, since it basically gives guidance for protocols
and applications needing equivalence, and doesn't directly have a
scope of application?
o Distingish between IRIs as sequence-of-unicode characters and
presentations of IRIs.
o Should we insist that percent-hex encoding equivalence of non-
reserved characters MUST be always used if there is any
equivalence at all?
o Update security considerations to describe security concerns
specific to comparison.
o Consider making sections talk about 'equivalent' rather than
'normalization' where appropriate.
One of the most common operations on IRIs is simple comparison: One of the most common operations on IRIs is simple comparison:
Determining whether two IRIs are equivalent, without using the IRIs Determining whether two IRIs are equivalent, without using the IRIs
to access their respective resource(s). A comparison is performed to access their respective resource(s). A comparison is performed
whenever a response cache is accessed, a browser checks its history whenever a response cache is accessed, a browser checks its history
to color a link, or an XML parser processes tags within a namespace. to color a link, or an XML parser processes tags within a namespace.
Extensive normalization prior to comparison of IRIs may be used by Extensive normalization prior to comparison of IRIs may be used by
spiders and indexing engines to prune a search space or reduce spiders and indexing engines to prune a search space or reduce
duplication of request actions and response storage. duplication of request actions and response storage.
IRI comparison is performed for some particular purpose. Protocols IRI comparison is performed for some particular purpose. Protocols
or implementations that compare IRIs for different purposes will or implementations that compare IRIs for different purposes will
skipping to change at page 5, line 29 skipping to change at page 4, line 44
use them. use them.
2. Equivalence 2. Equivalence
Because IRIs exist to identify resources, presumably they should be Because IRIs exist to identify resources, presumably they should be
considered equivalent when they identify the same resource. However, considered equivalent when they identify the same resource. However,
this definition of equivalence is not of much practical use, as there this definition of equivalence is not of much practical use, as there
is no way for an implementation to compare two resources to determine is no way for an implementation to compare two resources to determine
if they are "the same" unless it has full knowledge or control of if they are "the same" unless it has full knowledge or control of
them. For this reason, determination of equivalence or difference of them. For this reason, determination of equivalence or difference of
IRIs is based on string comparison, perhaps augmented by reference to IRIs is based on string comparison, augmented by reference to
additional rules provided by URI scheme definitions. We use the additional rules provided by scheme definition. We use the terms
terms "different" and "equivalent" to describe the possible outcomes "different" and "equivalent" to describe the possible outcomes of
of such comparisons, but there are many application-dependent such comparisons, but there are many application-dependent versions
versions of equivalence. of equivalence.
Even when it is possible to determine that two IRIs are equivalent, Even when it is possible to determine that two IRIs are equivalent,
IRI comparison is not sufficient to determine whether two IRIs IRI comparison is not sufficient to determine whether two IRIs
identify different resources. For example, an owner of two different identify different resources. For example, an owner of two different
domain names could decide to serve the same resource from both, domain names could decide to serve the same resource from both,
resulting in two different IRIs. Therefore, comparison methods are resulting in two different IRIs. Therefore, comparison methods are
designed to minimize false negatives while strictly avoiding false designed to minimize false negatives while strictly avoiding false
positives. positives.
In testing for equivalence, applications should not directly compare In testing for equivalence, applications should not directly compare
relative references; the references should be converted to their relative references; the references should be converted to their
respective target IRIs before comparison. When IRIs are compared to respective target IRIs before comparison. When IRIs are compared to
select (or avoid) a network action, such as retrieval of a select (or avoid) a network action, such as retrieval of a
representation, fragment components (if any) MUST be excluded from representation, fragment components (if any) MUST be excluded from
the comparison. the comparison.
Applications using IRIs as identity tokens with no relationship to a Applications using IRIs as identity tokens with no relationship to a
protocol MUST use the Simple String Comparison (see Section 4.1). protocol MUST use the Simple String Comparison (see Section 5.1).
All other applications MUST select one of the comparison practices All other applications MUST select one of the comparison practices
from the Comparison Ladder (see Section 4. from the Comparison Ladder (see Section 5.
3. Preparation for Comparison 3. Comparison, Equivalence, Normalization and Canonicalization
In general, when considering a set of items or strings, there are
several interrelated concepts. A comparison method determines,
between two items in the set, their relationship. In particular, a
comparison method for determining equivalence might result in a
determination that two (different) items are equivalent, known to be
different, or that equivalence isn't determined.
One way to define a comparison for equivalence is to define a a
normalization or canonicalization algorithm. For each item in a set
of equivalent items, one of them could be designated the "normal" or
"canonical" form.
These general concepts are used with IRIs in this document, and in
other circumstances, where a mapping from one sequence of Unicode
characters to another one could be described as a "normalization"
algorithm.
In general, this document tries to stay with the "equivalence" or
"comparison" methods, become some times the mathematical notion of
"normalization" results in forms that ordinary users might not
consider "normal" in an ordinary sense.
4. Preparation for Comparison
Any kind of IRI comparison REQUIRES that any additional contextual Any kind of IRI comparison REQUIRES that any additional contextual
processing is first performed, including undoing higher-level processing is first performed, including undoing higher-level
escapings or encodings in the protocol or format that carries an IRI. escapings or encodings in the protocol or format that carries an IRI.
This preprocessing is usually done when the protocol or format is This preprocessing is usually done when the protocol or format is
parsed. parsed.
Examples of such escapings or encodings are entities and numeric Examples of such escapings or encodings are entities and numeric
character references in [HTML4] and [XML1]. As an example, character references in [HTML4] and [XML1]. As an example,
"http://example.org/rosé" (in HTML), "http://example.org/rosé" (in HTML),
skipping to change at page 6, line 31 skipping to change at page 6, line 23
what is denoted in this document (see 'Notation' section of what is denoted in this document (see 'Notation' section of
[RFC3987bis]) as "http://example.org/rosé" (the "é" here [RFC3987bis]) as "http://example.org/rosé" (the "é" here
standing for the actual e-acute character, to compensate for the fact standing for the actual e-acute character, to compensate for the fact
that this document cannot contain non-ASCII characters). that this document cannot contain non-ASCII characters).
Similar considerations apply to encodings such as Transfer Codings in Similar considerations apply to encodings such as Transfer Codings in
HTTP (see [RFC2616]) and Content Transfer Encodings in MIME HTTP (see [RFC2616]) and Content Transfer Encodings in MIME
([RFC2045]), although in these cases, the encoding is based not on ([RFC2045]), although in these cases, the encoding is based not on
characters but on octets, and additional care is required to make characters but on octets, and additional care is required to make
sure that characters, and not just arbitrary octets, are compared sure that characters, and not just arbitrary octets, are compared
(see Section 4.1). (see Section 5.1).
4. Comparison Ladder 5. Comparison Ladder
In practice, a variety of methods are used to test IRI equivalence. In practice, a variety of methods are used to test IRI equivalence.
These methods fall into a range distinguished by the amount of These methods fall into a range distinguished by the amount of
processing required and the degree to which the probability of false processing required and the degree to which the probability of false
negatives is reduced. As noted above, false negatives cannot be negatives is reduced. As noted above, false negatives cannot be
eliminated. In practice, their probability can be reduced, but this eliminated. In practice, their probability can be reduced, but this
reduction requires more processing and is not cost-effective for all reduction requires more processing and is not cost-effective for all
applications. applications.
If this range of comparison practices is considered as a ladder, the If this range of comparison practices is considered as a ladder, the
following discussion will climb the ladder, starting with practices following discussion will climb the ladder, starting with practices
that are cheap but have a relatively higher chance of producing false that are cheap but have a relatively higher chance of producing false
negatives, and proceeding to those that have higher computational negatives, and proceeding to those that have higher computational
cost and lower risk of false negatives. cost and lower risk of false negatives.
4.1. Simple String Comparison 5.1. Simple String Comparison
If two IRIs, when considered as character strings, are identical, If two IRIs, when considered as character strings, are identical,
then it is safe to conclude that they are equivalent. This type of then it is safe to conclude that they are equivalent. This type of
equivalence test has very low computational cost and is in wide use equivalence test has very low computational cost and is in wide use
in a variety of applications, particularly in the domain of parsing. in a variety of applications, particularly in the domain of parsing.
It is also used when a definitive answer to the question of IRI It is also used when a definitive answer to the question of IRI
equivalence is needed that is independent of the scheme used and that equivalence is needed that is independent of the scheme used and that
can be calculated quickly and without accessing a network. An can be calculated quickly and without accessing a network. An
example of such a case is XML Namespaces ([XMLNamespace]). example of such a case is XML Namespaces ([XMLNamespace]).
skipping to change at page 8, line 5 skipping to change at page 7, line 40
Unnecessary aliases can be reduced, regardless of the comparison Unnecessary aliases can be reduced, regardless of the comparison
method, by consistently providing IRI references in an already method, by consistently providing IRI references in an already
normalized form (i.e., a form identical to what would be produced normalized form (i.e., a form identical to what would be produced
after normalization is applied, as described below). Protocols and after normalization is applied, as described below). Protocols and
data formats often limit some IRI comparisons to simple string data formats often limit some IRI comparisons to simple string
comparison, based on the theory that people and implementations will, comparison, based on the theory that people and implementations will,
in their own best interest, be consistent in providing IRI in their own best interest, be consistent in providing IRI
references, or at least be consistent enough to negate any efficiency references, or at least be consistent enough to negate any efficiency
that might be obtained from further normalization. that might be obtained from further normalization.
4.2. Syntax-Based Normalization 5.2. Syntax-Based Equivalence
Implementations may use logic based on the definitions provided by Implementations may use logic based on the definitions provided by
this specification to reduce the probability of false negatives. this specification to reduce the probability of false negatives.
This processing is moderately higher in cost than character-for- This processing is moderately higher in cost than character-for-
character string comparison. For example, an application using this character string comparison. For example, an application using this
approach could reasonably consider the following two IRIs equivalent: approach could reasonably consider the following two IRIs equivalent:
example://a/b/c/%7Bfoo%7D/rosé example://a/b/c/%7Bfoo%7D/rosé
eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9 eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
Web user agents, such as browsers, typically apply this type of IRI Web user agents, such as browsers, typically apply this type of IRI
normalization when determining whether a cached response is normalization when determining whether a cached response is
available. Syntax-based normalization includes such techniques as available. Syntax-based normalization includes such techniques as
case normalization, character normalization, percent-encoding case normalization, character normalization, percent-encoding
normalization, and removal of dot-segments. normalization, and removal of dot-segments.
4.2.1. Case Normalization 5.2.1. Case Equivalence
For all IRIs, the hexadecimal digits within a percent-encoding For all IRIs, the hexadecimal digits within a percent-encoding
triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
should be normalized to use uppercase letters for the digits A-F. should be considered equivalent to forms which use uppercase letters
for the digits A-F.
When an IRI uses components of the generic syntax, the component When an IRI uses components of the generic syntax, the component
syntax equivalence rules always apply; namely, that the scheme and syntax equivalence rules always apply; namely, that the scheme and
US-ASCII only host are case insensitive and therefore should be US-ASCII only host are case insensitive and therefore should be
normalized to lowercase. For example, the URI normalized to lowercase. For example, the URI
"HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/". "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/".
Case equivalence for non-ASCII characters in IRI components that are Case equivalence for non-ASCII characters in IRI components that are
IDNs are discussed in Section 4.3. The other generic syntax IDNs are discussed in Section 5.3. The other generic syntax
components are assumed to be case sensitive unless specifically components are assumed to be case sensitive unless specifically
defined otherwise by the scheme. defined otherwise by the scheme.
Creating schemes that allow case-insensitive syntax components Creating schemes that allow case-insensitive syntax components
containing non-ASCII characters should be avoided. Case containing non-ASCII characters should be avoided. Case
normalization of non-ASCII characters can be culturally dependent and normalization of non-ASCII characters can be culturally dependent and
is always a complex operation. The only exception concerns non-ASCII is always a complex operation. The only exception concerns non-ASCII
host names for which the character normalization includes a mapping host names for which the character normalization includes a mapping
step derived from case folding. step derived from case folding.
4.2.2. Character Normalization 5.2.2. Unicode Character Normalization
The Unicode Standard [UNIV6] defines various equivalences between The Unicode Standard [UNIV6] defines various equivalences between
sequences of characters for various purposes. Unicode Standard Annex sequences of characters for various purposes. Unicode Standard Annex
#15 [UTR15] defines various Normalization Forms for these #15 [UTR15] defines various Normalization Forms for these
equivalences, in particular Normalization Form C (NFC, Canonical equivalences, in particular Normalization Form C (NFC, Canonical
Decomposition, followed by Canonical Composition) and Normalization Decomposition, followed by Canonical Composition) and Normalization
Form KC (NFKC, Compatibility Decomposition, followed by Canonical Form KC (NFKC, Compatibility Decomposition, followed by Canonical
Composition). Composition).
IRIs already in Unicode MUST NOT be normalized before parsing or IRIs already in Unicode MUST NOT be normalized before parsing or
skipping to change at page 10, line 6 skipping to change at page 9, line 43
unclear whether they are case sensitive, case insensitive, or unclear whether they are case sensitive, case insensitive, or
something in between (e.g., case sensitive, but with a multiple something in between (e.g., case sensitive, but with a multiple
choice selection if the wrong case is used, instead of a direct choice selection if the wrong case is used, instead of a direct
negative result). The best recipe is that the creator use a negative result). The best recipe is that the creator use a
reasonable capitalization and, when transferring the URI, reasonable capitalization and, when transferring the URI,
capitalization never be changed. capitalization never be changed.
Various IRI schemes may allow the usage of Internationalized Domain Various IRI schemes may allow the usage of Internationalized Domain
Names (IDN) [RFC5890] either in the ireg-name part or elsewhere. Names (IDN) [RFC5890] either in the ireg-name part or elsewhere.
Character Normalization also applies to IDNs, as discussed in Character Normalization also applies to IDNs, as discussed in
Section 4.3. Section 5.3.
4.2.3. Percent-Encoding Normalization 5.2.3. Percent-Encoding Equivalence
The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a The percent-encoding mechanism (Section 2.1 of [RFC3986]) is a
frequent source of variance among otherwise identical IRIs. In frequent source of variance among otherwise identical IRIs. In
addition to the case normalization issue noted above, some IRI addition to the case normalization issue noted above, some IRI
producers percent-encode octets that do not require percent-encoding, producers percent-encode octets that do not require percent-encoding,
resulting in IRIs that are equivalent to their nonencoded resulting in IRIs that are equivalent to their nonencoded
counterparts. These IRIs should be normalized by decoding any counterparts. These IRIs should be normalized by decoding any
percent-encoded octet sequence that corresponds to an unreserved percent-encoded octet sequence that corresponds to an unreserved
character, as described in section 2.3 of [RFC3986]. character, as described in section 2.3 of [RFC3986].
For actual resolution, differences in percent-encoding (except for For actual resolution, differences in percent-encoding (except for
the percent-encoding of reserved characters) MUST always result in the percent-encoding of reserved characters) MUST always result in
the same resource. For example, "http://example.org/~user", the same resource. For example, "http://example.org/~user",
"http://example.org/%7euser", and "http://example.org/%7Euser", must "http://example.org/%7euser", and "http://example.org/%7Euser", must
resolve to the same resource. resolve to the same resource.
If this kind of equivalence is to be tested, the percent-encoding of If this kind of equivalence is to be tested, the percent-encoding of
both IRIs to be compared has to be aligned; for example, by both IRIs to be compared has to be aligned; for example, by
converting both IRIs to URIs (see Section 3.1), eliminating escape converting both IRIs to URIs, eliminating escape differences in the
differences in the resulting URIs, and making sure that the case of resulting URIs, and making sure that the case of the hexadecimal
the hexadecimal characters in the percent-encoding is always the same characters in the percent-encoding is always the same (preferably
(preferably upper case). If the IRI is to be passed to another upper case). If the IRI is to be passed to another application or
application or used further in some other way, its original form MUST used further in some other way, its original form MUST be preserved.
be preserved. The conversion described here should be performed only The conversion described here should be performed only for local
for local comparison. comparison.
4.2.4. Path Segment Normalization 5.2.4. Path Segment Equivalence
The complete path segments "." and ".." are intended only for use The complete path segments "." and ".." are intended only for use
within relative references (Section 4.1 of [RFC3986]) and are removed within relative references (Section 4.1 of [RFC3986]) and are removed
as part of the reference resolution process (Section 5.2 of as part of the reference resolution process (Section 5.2 of
[RFC3986]). However, some implementations may incorrectly assume [RFC3986]). However, some implementations may incorrectly assume
that reference resolution is not necessary when the reference is that reference resolution is not necessary when the reference is
already an IRI, and thus fail to remove dot-segments when they occur already an IRI, and thus fail to remove dot-segments when they occur
in non-relative paths. IRI normalizers should remove dot-segments by in non-relative paths. IRI normalizers should remove dot-segments by
applying the remove_dot_segments algorithm to the path, as described applying the remove_dot_segments algorithm to the path, as described
in Section 5.2.4 of [RFC3986]. in Section 5.2.4 of [RFC3986].
4.3. Scheme-Based Normalization 5.3. Scheme-Based Comparison
The syntax and semantics of IRIs vary from scheme to scheme, as The syntax and semantics of IRIs vary from scheme to scheme, as
described by the defining specification for each scheme. described by the defining specification for each scheme.
Implementations may use scheme-specific rules, at further processing Implementations may use scheme-specific rules, at further processing
cost, to reduce the probability of false negatives. For example, cost, to reduce the probability of false negatives. For example,
because the "http" scheme makes use of an authority component, has a because the "http" scheme makes use of an authority component, has a
default port of "80", and defines an empty path to be equivalent to default port of "80", and defines an empty path to be equivalent to
"/", the following four IRIs are equivalent: "/", the following four IRIs are equivalent:
http://example.com http://example.com
skipping to change at page 12, line 9 skipping to change at page 11, line 45
be resolved. For legibility purposes, they SHOULD NOT be converted be resolved. For legibility purposes, they SHOULD NOT be converted
into ASCII Compatible Encoding (ACE). into ASCII Compatible Encoding (ACE).
Scheme-based normalization may also consider IDN components and their Scheme-based normalization may also consider IDN components and their
conversions to punycode as equivalent. As an example, conversions to punycode as equivalent. As an example,
"http://résumé.example.org" may be considered equivalent to "http://résumé.example.org" may be considered equivalent to
"http://xn--rsum-bpad.example.org". "http://xn--rsum-bpad.example.org".
Other scheme-specific normalizations are possible. Other scheme-specific normalizations are possible.
4.4. Protocol-Based Normalization 5.4. Protocol-Based Comparison
Substantial effort to reduce the incidence of false negatives is Substantial effort to reduce the incidence of false negatives is
often cost-effective for web spiders. Consequently, they implement often cost-effective for web spiders. Consequently, they implement
even more aggressive techniques in IRI comparison. For example, if even more aggressive techniques in IRI comparison. For example, if
they observe that an IRI such as they observe that an IRI such as
http://example.com/data http://example.com/data
redirects to an IRI differing only in the trailing slash redirects to an IRI differing only in the trailing slash
http://example.com/data/ http://example.com/data/
they will likely regard the two as equivalent in the future. This they will likely regard the two as equivalent in the future. This
kind of technique is only appropriate when equivalence is clearly kind of technique is only appropriate when equivalence is clearly
indicated by both the result of accessing the resources and the indicated by both the result of accessing the resources and the
common conventions of their scheme's dereference algorithm (in this common conventions of their scheme's dereference algorithm (in this
skipping to change at page 12, line 29 skipping to change at page 12, line 17
http://example.com/data/ http://example.com/data/
they will likely regard the two as equivalent in the future. This they will likely regard the two as equivalent in the future. This
kind of technique is only appropriate when equivalence is clearly kind of technique is only appropriate when equivalence is clearly
indicated by both the result of accessing the resources and the indicated by both the result of accessing the resources and the
common conventions of their scheme's dereference algorithm (in this common conventions of their scheme's dereference algorithm (in this
case, use of redirection by HTTP origin servers to avoid problems case, use of redirection by HTTP origin servers to avoid problems
with relative references). with relative references).
5. Security Considerations 6. Security Considerations
The primary security difficulty comes from applications choosing the The primary security difficulty comes from applications choosing the
wrong equivalence relationship, or two different parties disagreeing wrong equivalence relationship, or two different parties disagreeing
on equivalence. This is especially a problem when IRIs are used in on equivalence. This is especially a problem when IRIs are used in
security protocols. security protocols.
Besides the large character repertoire of Unicode, reasons for Besides the large character repertoire of Unicode, reasons for
confusion include different forms of normalization and different confusion include different forms of normalization and different
normalization expectations, use of percent-encoding with various normalization expectations, use of percent-encoding with various
legacy encodings, and bidirectionality issues. See also [UTR36]. legacy encodings, and bidirectionality issues. See also [UTR36].
6. Acknowledgements 7. Acknowledgements
This document was originally derived from [RFC3986] and [RFC3987], This document was originally derived from [RFC3986] and [RFC3987],
based on text contributed by Tim Bray. based on text contributed by Tim Bray.
7. References 8. References
7.1. Normative References
8.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)", "Internationalizing Domain Names in Applications (IDNA)",
RFC 3490, March 2003. RFC 3490, March 2003.
[RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
Profile for Internationalized Domain Names (IDN)", Profile for Internationalized Domain Names (IDN)",
skipping to change at page 13, line 42 skipping to change at page 13, line 30
[UNIV6] The Unicode Consortium, "The Unicode Standard, Version [UNIV6] The Unicode Consortium, "The Unicode Standard, Version
6.0.0 (Mountain View, CA, The Unicode Consortium, 2011, 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011,
ISBN 978-1-936213-01-6)", October 2010. ISBN 978-1-936213-01-6)", October 2010.
[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms",
Unicode Standard Annex #15, March 2008, Unicode Standard Annex #15, March 2008,
<http://www.unicode.org/unicode/reports/tr15/ <http://www.unicode.org/unicode/reports/tr15/
tr15-23.html>. tr15-23.html>.
7.2. Informative References 8.2. Informative References
[HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
Specification", World Wide Web Consortium Recommendation, Specification", World Wide Web Consortium Recommendation,
December 1999, December 1999,
<http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>. <http://www.w3.org/TR/html401/appendix/notes.html#h-B.2>.
[RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Extensions (MIME) Part One: Format of Internet Message
Bodies", RFC 2045, November 1996. Bodies", RFC 2045, November 1996.
 End of changes. 30 change blocks. 
89 lines changed or deleted 81 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/