draft-ietf-iri-3987bis-03.txt   draft-ietf-iri-3987bis-04.txt 
Internationalized Resource M. Duerst Internationalized Resource M. Duerst
Identifiers (iri) Aoyama Gakuin University Identifiers (iri) Aoyama Gakuin University
Internet-Draft M. Suignard Internet-Draft M. Suignard
Obsoletes: 3987 (if approved) Unicode Consortium Obsoletes: 3987 (if approved) Unicode Consortium
Intended status: Standards Track L. Masinter Intended status: Standards Track L. Masinter
Expires: April 28, 2011 Adobe Expires: September 15, 2011 Adobe
October 25, 2010 March 14, 2011
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-ietf-iri-3987bis-03 draft-ietf-iri-3987bis-04
Abstract Abstract
This document defines the Internationalized Resource Identifier (IRI) This document defines the Internationalized Resource Identifier (IRI)
protocol element, as an extension of the Uniform Resource Identifier protocol element, as an extension of the Uniform Resource Identifier
(URI). An IRI is a sequence of characters from the Universal (URI). An IRI is a sequence of characters from the Universal
Character Set (Unicode/ISO 10646). Grammar and processing rules are Character Set (Unicode/ISO 10646). Grammar and processing rules are
given for IRIs and related syntactic forms. given for IRIs and related syntactic forms.
In addition, this document provides named additional rule sets for In addition, this document provides named additional rule sets for
skipping to change at page 2, line 21 skipping to change at page 2, line 21
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 28, 2011. This Internet-Draft will expire on September 15, 2011.
Copyright Notice Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 3, line 10 skipping to change at page 3, line 10
outside the IETF Standards Process, and derivative works of it may outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other it for publication as an RFC or to translate it into languages other
than English. than English.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5 1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5
1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6
1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 6 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 7
1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 9
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 10 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 10
2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 10
3. Processing IRIs and related protocol elements . . . . . . . . 13 3. Processing IRIs and related protocol elements . . . . . . . . 13
3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 14 3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 14
3.2. Parse the IRI into IRI components . . . . . . . . . . . . 14 3.2. Parse the IRI into IRI components . . . . . . . . . . . . 14
3.3. General percent-encoding of IRI components . . . . . . . 15 3.3. General percent-encoding of IRI components . . . . . . . 15
3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 16 3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 16
3.5. Mapping query components . . . . . . . . . . . . . . . . 17 3.5. Mapping query components . . . . . . . . . . . . . . . . 17
skipping to change at page 4, line 10 skipping to change at page 4, line 10
8.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 41 8.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 41
8.3. URI/IRI Transfer between Applications . . . . . . . . . . 42 8.3. URI/IRI Transfer between Applications . . . . . . . . . . 42
8.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 42 8.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . 42
8.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 43 8.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 43
8.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 43 8.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 43
8.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 44 8.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 44
8.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 44 8.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . 44
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 45 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 45
10. Security Considerations . . . . . . . . . . . . . . . . . . . 46 10. Security Considerations . . . . . . . . . . . . . . . . . . . 46
11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 47 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 47
12. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 48 12. Main Changes Since RFC 3987 . . . . . . . . . . . . . . . . . 48
12.1. Changes from draft-duerst-iri-bis-07 to 12.1. Major restructuring of IRI processing model . . . . . . . 48
draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . . . 48 12.1.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 48
12.2. Changes from -06 to -07 of draft-duerst-iri-bis . . . . . 48 12.1.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 49
12.2.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 49 12.1.3. Extension of Syntax . . . . . . . . . . . . . . . . . 49
12.2.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 49 12.1.4. More to be added . . . . . . . . . . . . . . . . . . . 49
12.3. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 49 12.2. Change Log . . . . . . . . . . . . . . . . . . . . . . . 49
12.4. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 49 12.2.1. Changes after draft-ietf-iri-3987bis-01 . . . . . . . 49
12.2.2. Changes from draft-duerst-iri-bis-07 to
draft-ietf-iri-3987bis-00 . . . . . . . . . . . . . . 49
12.2.3. Changes from -06 to -07 of draft-duerst-iri-bis . . . 49
12.3. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 50
12.4. Changes from -05 to -06 of draft-duerst-iri-bis-00 . . . 50
12.5. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 50 12.5. Changes from -04 to -05 of draft-duerst-iri-bis . . . . . 50
12.6. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 50 12.6. Changes from -03 to -04 of draft-duerst-iri-bis . . . . . 50
12.7. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 50 12.7. Changes from -02 to -03 of draft-duerst-iri-bis . . . . . 50
12.8. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 50 12.8. Changes from -01 to -02 of draft-duerst-iri-bis . . . . . 50
12.9. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 50 12.9. Changes from -00 to -01 of draft-duerst-iri-bis . . . . . 51
12.10. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 51 12.10. Changes from RFC 3987 to -00 of draft-duerst-iri-bis . . 51
13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 51 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 51
13.1. Normative References . . . . . . . . . . . . . . . . . . 51 13.1. Normative References . . . . . . . . . . . . . . . . . . 51
13.2. Informative References . . . . . . . . . . . . . . . . . 52 13.2. Informative References . . . . . . . . . . . . . . . . . 52
Appendix A. Design Alternatives . . . . . . . . . . . . . . . . . 54 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 55
A.1. New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 54
A.2. Character Encodings Other Than UTF-8 . . . . . . . . . . 55
A.3. New Encoding Convention . . . . . . . . . . . . . . . . . 55
A.4. Indicating Character Encodings in the URI/IRI . . . . . . 55
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 56
1. Introduction 1. Introduction
1.1. Overview and Motivation 1.1. Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFC3986] as a A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
sequence of characters chosen from a limited subset of the repertoire sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters. of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of The characters in URIs are frequently used for representing words of
skipping to change at page 6, line 5 skipping to change at page 6, line 5
as URI references. The syntax of IRIs is defined in Section 2. as URI references. The syntax of IRIs is defined in Section 2.
Using characters outside of A - Z in IRIs adds a number of Using characters outside of A - Z in IRIs adds a number of
difficulties. Section 4 discusses the special case of bidirectional difficulties. Section 4 discusses the special case of bidirectional
IRIs using characters from scripts written right-to-left. Section 5 IRIs using characters from scripts written right-to-left. Section 5
discusses various forms of equivalence between IRIs. Section 6 discusses various forms of equivalence between IRIs. Section 6
discusses the use of IRIs in different situations. Section 8 gives discusses the use of IRIs in different situations. Section 8 gives
additional informative guidelines. Section 10 discusses IRI-specific additional informative guidelines. Section 10 discusses IRI-specific
security considerations. security considerations.
When originally defining IRIs, several design alternatives were
considered. Historically interested readers can find an overview in
Appendix A of [RFC3987]. For some additional background on the
design of URIs and IRIs, please also see [Gettys].
1.2. Applicability 1.2. Applicability
IRIs are designed to allow protocols and software that deal with URIs IRIs are designed to allow protocols and software that deal with URIs
to be updated to handle IRIs. A "URI scheme" (as defined by to be updated to handle IRIs. A "URI scheme" (as defined by
[RFC3986] and registered through the IANA process defined in [RFC3986] and registered through the IANA process defined in
[RFC4395bis] also serves as an "IRI scheme". Processing of IRIs is [RFC4395bis] also serves as an "IRI scheme". Processing of IRIs is
accomplished by extending the URI syntax while retaining (and not accomplished by extending the URI syntax while retaining (and not
expanding) the set of "reserved" characters, such that the syntax for expanding) the set of "reserved" characters, such that the syntax for
any URI scheme may be uniformly extended to allow non-ASCII any URI scheme may be uniformly extended to allow non-ASCII
characters. In addition, following parsing of an IRI, it is possible characters. In addition, following parsing of an IRI, it is possible
skipping to change at page 7, line 28 skipping to change at page 7, line 33
character encoding: A method of representing a sequence of character encoding: A method of representing a sequence of
characters as a sequence of octets (maybe with variants). Also, a characters as a sequence of octets (maybe with variants). Also, a
method of (unambiguously) converting a sequence of octets into a method of (unambiguously) converting a sequence of octets into a
sequence of characters. sequence of characters.
charset: The name of a parameter or attribute used to identify a charset: The name of a parameter or attribute used to identify a
character encoding. character encoding.
UCS: Universal Character Set. The coded character set defined by UCS: Universal Character Set. The coded character set defined by
ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4]. ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV6].
IRI reference: Denotes the common usage of an Internationalized IRI reference: Denotes the common usage of an Internationalized
Resource Identifier. An IRI reference may be absolute or Resource Identifier. An IRI reference may be absolute or
relative. However, the "IRI" that results from such a reference relative. However, the "IRI" that results from such a reference
only includes absolute IRIs; any relative IRI references are only includes absolute IRIs; any relative IRI references are
resolved to their absolute form. Note that in [RFC2396] URIs did resolved to their absolute form. Note that in [RFC2396] URIs did
not include fragment identifiers, but in [RFC3986] fragment not include fragment identifiers, but in [RFC3986] fragment
identifiers are part of URIs. identifiers are part of URIs.
URL: The term "URL" was originally used [RFC1738] for roughly what URL: The term "URL" was originally used [RFC1738] for roughly what
skipping to change at page 21, line 21 skipping to change at page 21, line 21
4.1. Logical Storage and Visual Presentation 4.1. Logical Storage and Visual Presentation
When stored or transmitted in digital representation, bidirectional When stored or transmitted in digital representation, bidirectional
IRIs MUST be in full logical order and MUST conform to the IRI syntax IRIs MUST be in full logical order and MUST conform to the IRI syntax
rules (which includes the rules relevant to their scheme). This rules (which includes the rules relevant to their scheme). This
ensures that bidirectional IRIs can be processed in the same way as ensures that bidirectional IRIs can be processed in the same way as
other IRIs. other IRIs.
Bidirectional IRIs MUST be rendered by using the Unicode Bidirectional IRIs MUST be rendered by using the Unicode
Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be Bidirectional Algorithm [UNIV6], [UNI9]. Bidirectional IRIs MUST be
rendered in the same way as they would be if they were in a left-to- rendered in the same way as they would be if they were in a left-to-
right embedding; i.e., as if they were preceded by U+202A, LEFT-TO- right embedding; i.e., as if they were preceded by U+202A, LEFT-TO-
RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL
FORMATTING (PDF). Setting the embedding direction can also be done FORMATTING (PDF). Setting the embedding direction can also be done
in a higher-level protocol (e.g., the dir='ltr' attribute in HTML). in a higher-level protocol (e.g., the dir='ltr' attribute in HTML).
There is no requirement to use the above embedding if the display is There is no requirement to use the above embedding if the display is
still the same without the embedding. For example, a bidirectional still the same without the embedding. For example, a bidirectional
IRI in a text with left-to-right base directionality (such as used IRI in a text with left-to-right base directionality (such as used
for English or Cyrillic) that is preceded and followed by whitespace for English or Cyrillic) that is preceded and followed by whitespace
skipping to change at page 29, line 30 skipping to change at page 29, line 30
Creating schemes that allow case-insensitive syntax components Creating schemes that allow case-insensitive syntax components
containing non-ASCII characters should be avoided. Case containing non-ASCII characters should be avoided. Case
normalization of non-ASCII characters can be culturally dependent and normalization of non-ASCII characters can be culturally dependent and
is always a complex operation. The only exception concerns non-ASCII is always a complex operation. The only exception concerns non-ASCII
host names for which the character normalization includes a mapping host names for which the character normalization includes a mapping
step derived from case folding. step derived from case folding.
5.3.2.2. Character Normalization 5.3.2.2. Character Normalization
The Unicode Standard [UNIV4] defines various equivalences between The Unicode Standard [UNIV6] defines various equivalences between
sequences of characters for various purposes. Unicode Standard Annex sequences of characters for various purposes. Unicode Standard Annex
#15 [UTR15] defines various Normalization Forms for these #15 [UTR15] defines various Normalization Forms for these
equivalences, in particular Normalization Form C (NFC, Canonical equivalences, in particular Normalization Form C (NFC, Canonical
Decomposition, followed by Canonical Composition) and Normalization Decomposition, followed by Canonical Composition) and Normalization
Form KC (NFKC, Compatibility Decomposition, followed by Canonical Form KC (NFKC, Compatibility Decomposition, followed by Canonical
Composition). Composition).
IRIs already in Unicode MUST NOT be normalized before parsing or IRIs already in Unicode MUST NOT be normalized before parsing or
interpreting. In many non-Unicode character encodings, some text interpreting. In many non-Unicode character encodings, some text
cannot be represented directly. For example, the word "Vietnam" is cannot be represented directly. For example, the word "Vietnam" is
skipping to change at page 30, line 17 skipping to change at page 30, line 17
avoid even more problems; for example, by choosing half-width Latin avoid even more problems; for example, by choosing half-width Latin
letters instead of full-width ones, and full-width instead of half- letters instead of full-width ones, and full-width instead of half-
width Katakana. width Katakana.
As an example, "http://www.example.org/résumé.html" (in XML As an example, "http://www.example.org/résumé.html" (in XML
Notation) is in NFC. On the other hand, Notation) is in NFC. On the other hand,
"http://www.example.org/résumé.html" is not in NFC. "http://www.example.org/résumé.html" is not in NFC.
The former uses precombined e-acute characters, and the latter uses The former uses precombined e-acute characters, and the latter uses
"e" characters followed by combining acute accents. Both usages are "e" characters followed by combining acute accents. Both usages are
defined as canonically equivalent in [UNIV4]. defined as canonically equivalent in [UNIV6].
Note: Because it is unknown how a particular sequence of characters Note: Because it is unknown how a particular sequence of characters
is being treated with respect to character normalization, it would is being treated with respect to character normalization, it would
be inappropriate to allow third parties to normalize an IRI be inappropriate to allow third parties to normalize an IRI
arbitrarily. This does not contradict the recommendation that arbitrarily. This does not contradict the recommendation that
when a resource is created, its IRI should be as character when a resource is created, its IRI should be as character
normalized as possible (i.e., NFC or even NFKC). This is similar normalized as possible (i.e., NFC or even NFKC). This is similar
to the uppercase/lowercase problems. Some parts of a URI are case to the uppercase/lowercase problems. Some parts of a URI are case
insensitive (for example, the domain name). For others, it is insensitive (for example, the domain name). For others, it is
unclear whether they are case sensitive, case insensitive, or unclear whether they are case sensitive, case insensitive, or
skipping to change at page 36, line 40 skipping to change at page 36, line 40
known as Legacy Extended IRI or LEIRI [LEIRI], and Web Address known as Legacy Extended IRI or LEIRI [LEIRI], and Web Address
[HTML5]). [HTML5]).
Future technical specifications SHOULD NOT allow conforming producers Future technical specifications SHOULD NOT allow conforming producers
to produce, or conforming content to contain, such forms, as they are to produce, or conforming content to contain, such forms, as they are
not interoperable with other IRI consuming software. not interoperable with other IRI consuming software.
7.1. LEIRI processing 7.1. LEIRI processing
This section defines Legacy Extended IRIs (LEIRIs). The syntax of This section defines Legacy Extended IRIs (LEIRIs). The syntax of
Legacy Extended IRIs is the same as that for IRIs, except that the Legacy Extended IRIs is the same as that for <IRI-reference>, except
ucschar production is replaced by the leiri-ucschar production: that the ucschar production is replaced by the leiri-ucschar
production:
leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|" leiri-ucschar = " " / "<" / ">" / '"' / "{" / "}" / "|"
/ "\" / "^" / "`" / %x0-1F / %x7F-D7FF / "\" / "^" / "`" / %x0-1F / %x7F-D7FF
/ %xE000-FFFD / %x10000-10FFFF / %xE000-FFFD / %x10000-10FFFF
Among other extensions, processors based on this specification also Among other extensions, processors based on this specification also
did not enforce the restriction on bidirectional formatting did not enforce the restriction on bidirectional formatting
characters in Section 4.1, and the iprivate production becomes characters in Section 4.1, and the iprivate production becomes
redundant. redundant.
skipping to change at page 47, line 37 skipping to change at page 47, line 37
Section 4.2 are not followed. The same visual representation may be Section 4.2 are not followed. The same visual representation may be
interpreted as different logical representations, and vice versa. It interpreted as different logical representations, and vice versa. It
is also very important that a correct Unicode bidirectional is also very important that a correct Unicode bidirectional
implementation be used. implementation be used.
The use of Legacy Extended IRIs introduces additional security The use of Legacy Extended IRIs introduces additional security
issues. issues.
11. Acknowledgements 11. Acknowledgements
For contributions to this update, we would like to thank Ian Hickson, This document was derived from [RFC3987]; the acknowledgments from
Michael Sperberg-McQueen, Dan Connolly, Norman Walsh, Richard Tobin, that specification still apply.
Henry S. Thomson, and the XML Core Working Group of the W3C.
The discussion on the issue addressed here started a long time ago.
There was a thread in the HTML working group in August 1995 (under
the topic of "Globalizing URIs") and in the www-international mailing
list in July 1996 (under the topic of "Internationalization and
URLs"), and there were ad-hoc meetings at the Unicode conferences in
September 1995 and September 1997.
For contributions to the previous version of this document, RFC 3987,
many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding,
Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley,
Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne,
Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan
Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan
Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris
Haynes, Walter Underwood, and many others.
A definition of HyperText Reference was initially produced by Ian
Hixson, and further edited by Dan Connolly and C. M. Spergerg-
McQueen.
Thanks to the Internationalization Working Group (I18N WG) of the
World Wide Web Consortium (W3C), and the members of the W3C I18N
Working Group and Interest Group for their contributions and their
work on [CharMod]. Thanks also go to the members of many other W3C
Working Groups for adopting IRIs, and to the members of the Montreal
IAB Workshop on Internationalization and Localization for their
review.
12. Change Log We would like to thank Ian Hickson, Michael Sperberg-McQueen, and Dan
Connolly for their work on HyperText References, and Norman Walsh,
Richard Tobin, Henry S. Thomson, John Cowan, Paul Grosso, and the XML
Core Working Group of the W3C for their work on LEIRIs.
Note to RFC Editor: Please completely remove this section before In addition, this document was influenced by contributions from (in
publication. no particular order) Chris Lilley, Bjoern Hoehrmann, Felix Sasaki,
Jeremy Carroll, Frank Ellermann, Michael Everson, Cary Karp,
Matitiahu Allouche, Richard Ishida, Addison Phillips, Jonathan
Rosenne, Najib Tounsi, Debbie Garside, Mark Davis, Sarmad Hussain,
Ted Hardie, Konrad Lanz, Thomas Roessler, Lisa Dusseault, Julian
Reschke, Giovanni Campagna, Anne van Kesteren, Mark Nottingham, Erik
van der Poel, Marcin Hanclik, Marcos Caceres, Roy Fielding, Greg
Wilkins, Pieter Hintjens, Daniel R. Tobias, Marko Martin, Maciej
Stanchowiak, Wil Tan, Yui Naruse, Michael A. Puls II, Dave Thaler,
Tom Perch, John Klensin, Shawn Steele, Peter Saint-Andre, Geoffrey
Sneddon, Chris Weber, Alex Melnikov, Slim Amamou, SM, Tim Berners-
Lee, Yaron Goland, Sam Ruby, Adam Barth, Abdulrahman I. ALGhadir,
Aharon Lanin, Thomas Milo, Murray Sargent, Marc Blanchet, and Mykyta
Yevstifeyev.
12.1. Changes from draft-duerst-iri-bis-07 to draft-ietf-iri-3987bis-00 12. Main Changes Since RFC 3987
Changed draft name, date, last paragraph of abstract, and titles in This section describes the main changes since [RFC3987].
change log, and added this section in moving from
draft-duerst-iri-bis-07 (personal submission) to
draft-ietf-iri-3987bis-00 (WG document).
12.2. Changes from -06 to -07 of draft-duerst-iri-bis 12.1. Major restructuring of IRI processing model
Major restructuring of IRI processing model to make scheme-specific Major restructuring of IRI processing model to make scheme-specific
translation necessary to handle IDNA requirements and for consistency translation necessary to handle IDNA requirements and for consistency
with web implementations. with web implementations.
Starting with IRI, you want one of: Starting with IRI, you want one of:
a IRI components (IRI parsed into UTF8 pieces) a IRI components (IRI parsed into UTF8 pieces)
b URI components (URI parsed into ASCII pieces, encoded correctly) b URI components (URI parsed into ASCII pieces, encoded correctly)
c whole URI (for passing on to some other system that wants whole c whole URI (for passing on to some other system that wants whole
URIs) URIs)
12.2.1. OLD WAY 12.1.1. OLD WAY
1. Pct-encoding on the whole thing to a URI. (c1) If you want a 1. Pct-encoding on the whole thing to a URI. (c1) If you want a
(maybe broken) whole URI, you might stop here. (maybe broken) whole URI, you might stop here.
2. Parsing the URI into URI components. (b1) If you want (maybe 2. Parsing the URI into URI components. (b1) If you want (maybe
broken) URI components, stop here. broken) URI components, stop here.
3. Decode the components (undoing the pct-encoding). (a) if you want 3. Decode the components (undoing the pct-encoding). (a) if you want
IRI components, stop here. IRI components, stop here.
4. reencode: Either using a different encoding some components (for 4. reencode: Either using a different encoding some components (for
domain names, and query components in web pages, which depends on domain names, and query components in web pages, which depends on
the component, scheme and context), and otherwise using pct- the component, scheme and context), and otherwise using pct-
encoding. (b2) if you want (good) URI components, stop here. encoding. (b2) if you want (good) URI components, stop here.
5. reassemble the reencoded components. (c2) if you want a (*good*) 5. reassemble the reencoded components. (c2) if you want a (*good*)
whole URI stop here. whole URI stop here.
12.2.2. NEW WAY 12.1.2. NEW WAY
1. Parse the IRI into IRI components using the generic syntax. (a) 1. Parse the IRI into IRI components using the generic syntax. (a)
if you want IRI components, stop here. if you want IRI components, stop here.
2. Encode each components, using pct-encoding, IDN encoding, or 2. Encode each components, using pct-encoding, IDN encoding, or
special query part encoding depending on the component scheme or special query part encoding depending on the component scheme or
context. (b) If you want URI components, stop here. context. (b) If you want URI components, stop here.
3. reassemble the a whole URI from URI components. (c) if you want a 3. reassemble the a whole URI from URI components. (c) if you want a
whole URI stop here. whole URI stop here.
12.1.3. Extension of Syntax
Added the tag range (U+E0000-E0FFF) to the iprivate production. Some
IRIs generated with the new syntax may fail to pass very strict
checks relying on the old syntax. But characters in this range
should be extremely infrequent anyway.
12.1.4. More to be added
TODO: There are more main changes that need to be documented in this
section.
12.2. Change Log
Note to RFC Editor: Please completely remove this section before
publication.
12.2.1. Changes after draft-ietf-iri-3987bis-01
Changes from draft-ietf-iri-3987bis-01 onwards are available as
changesets in the IETF tools subversion repository at http://
trac.tools.ietf.org/wg/iri/trac/log/draft-ietf-iri-3987bis/
draft-ietf-iri-3987bis.xml.
12.2.2. Changes from draft-duerst-iri-bis-07 to
draft-ietf-iri-3987bis-00
Changed draft name, date, last paragraph of abstract, and titles in
change log, and added this section in moving from
draft-duerst-iri-bis-07 (personal submission) to
draft-ietf-iri-3987bis-00 (WG document).
12.2.3. Changes from -06 to -07 of draft-duerst-iri-bis
Major restructuring of the processing model, see Section 12.1.
12.3. Changes from -00 to -01 12.3. Changes from -00 to -01
o Removed 'mailto:' before mail addresses of authors. o Removed 'mailto:' before mail addresses of authors.
o Added "<to be done>" as right side of 'href-strip' rule. Fixed o Added "<to be done>" as right side of 'href-strip' rule. Fixed
'|' to '/' for alternatives. '|' to '/' for alternatives.
12.4. Changes from -05 to -06 of draft-duerst-iri-bis-00 12.4. Changes from -05 to -06 of draft-duerst-iri-bis-00
o Add HyperText Reference, change abstract, acks and references for o Add HyperText Reference, change abstract, acks and references for
skipping to change at page 52, line 9 skipping to change at page 52, line 23
[RFC5891] Klensin, J., "Internationalized Domain Names in [RFC5891] Klensin, J., "Internationalized Domain Names in
Applications (IDNA): Protocol", RFC 5891, August 2010. Applications (IDNA): Protocol", RFC 5891, August 2010.
[STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax [STD68] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", STD 68, RFC 5234, January 2008. Specifications: ABNF", STD 68, RFC 5234, January 2008.
[UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard
Annex #9, March 2004, Annex #9, March 2004,
<http://www.unicode.org/reports/tr9/tr9-13.html>. <http://www.unicode.org/reports/tr9/tr9-13.html>.
[UNIV4] The Unicode Consortium, "The Unicode Standard, Version [UNIV6] The Unicode Consortium, "The Unicode Standard, Version
5.1.0, defined by: The Unicode Standard, Version 5.0 6.0.0 (Mountain View, CA, The Unicode Consortium, 2011,
(Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0), as ISBN 978-1-936213-01-6)", October 2010.
amended by Unicode 4.1.0
(http://www.unicode.org/versions/Unicode5.1.0/)",
April 2008.
[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms",
Unicode Standard Annex #15, March 2008, Unicode Standard Annex #15, March 2008,
<http://www.unicode.org/unicode/reports/tr15/ <http://www.unicode.org/unicode/reports/tr15/
tr15-23.html>. tr15-23.html>.
13.2. Informative References 13.2. Informative References
[BidiEx] "Examples of bidirectional IRIs", [BidiEx] "Examples of bidirectional IRIs",
<http://www.w3.org/International/iri-edit/BidiExamples>. <http://www.w3.org/International/iri-edit/BidiExamples>.
skipping to change at page 53, line 45 skipping to change at page 54, line 8
[RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397,
August 1998. August 1998.
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
[RFC2640] Curtin, B., "Internationalization of the File Transfer [RFC2640] Curtin, B., "Internationalization of the File Transfer
Protocol", RFC 2640, July 1999. Protocol", RFC 2640, July 1999.
[RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
Identifiers (IRIs)", RFC 3987, January 2005.
[RFC4395bis] [RFC4395bis]
Hansen, T., Hardie, T., and L. Masinter, "Guidelines and Hansen, T., Hardie, T., and L. Masinter, "Guidelines and
Registration Procedures for New URI/IRI Schemes", Registration Procedures for New URI/IRI Schemes",
draft-hansen-iri-4395bis-irireg-00 (work in progress), draft-hansen-iri-4395bis-irireg-00 (work in progress),
September 2010. September 2010.
[UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other
Markup Languages", Unicode Technical Report #20, World Markup Languages", Unicode Technical Report #20, World
Wide Web Consortium Note, June 2003, Wide Web Consortium Note, June 2003,
<http://www.w3.org/TR/unicode-xml/>. <http://www.w3.org/TR/unicode-xml/>.
skipping to change at page 54, line 39 skipping to change at page 55, line 5
Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
World Wide Web Consortium Recommendation, May 2001, World Wide Web Consortium Recommendation, May 2001,
<http://www.w3.org/TR/xmlschema-2/#anyURI>. <http://www.w3.org/TR/xmlschema-2/#anyURI>.
[XPointer] [XPointer]
Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer Grosso, P., Maler, E., Marsh, J., and N. Walsh, "XPointer
Framework", World Wide Web Consortium Recommendation, Framework", World Wide Web Consortium Recommendation,
March 2003, March 2003,
<http://www.w3.org/TR/xptr-framework/#escaping>. <http://www.w3.org/TR/xptr-framework/#escaping>.
Appendix A. Design Alternatives
This section briefly summarizes some design alternatives considered
earlier and the reasons why they were not chosen.
A.1. New Scheme(s)
Introducing new schemes (for example, httpi:, ftpi:,...) or a new
metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:,
i:ftp:,...) was proposed to make IRI-to-URI conversion scheme
dependent or to distinguish between percent-encodings resulting from
IRI-to-URI conversion and percent-encodings from legacy character
encodings.
New schemes are not needed to distinguish URIs from true IRIs (i.e.,
IRIs that contain non-ASCII characters). The benefit of being able
to detect the origin of percent-encodings is marginal, as UTF-8 can
be detected with very high reliability. Deploying new schemes is
extremely hard, so not requiring new schemes for IRIs makes
deployment of IRIs vastly easier. Making conversion scheme dependent
is highly inadvisable and would be encouraged by separate schemes for
IRIs. Using a uniform convention for conversion from IRIs to URIs
makes IRI implementation orthogonal to the introduction of actual new
schemes.
A.2. Character Encodings Other Than UTF-8
At an early stage, UTF-7 was considered as an alternative to UTF-8
when IRIs are converted to URIs. UTF-7 would not have needed
percent-encoding and in most cases would have been shorter than
percent-encoded UTF-8.
Using UTF-8 avoids a double layering and overloading of the use of
the "+" character. UTF-8 is fully compatible with US-ASCII and has
therefore been recommended by the IETF, and is being used widely.
UTF-7 has never been used much and is now clearly being discouraged.
Requiring implementations to convert from UTF-8 to UTF-7 and back
would be an additional implementation burden.
A.3. New Encoding Convention
Instead of using the existing percent-encoding convention of URIs,
which is based on octets, the idea was to create a new encoding
convention; for example, to use "%u" to introduce UCS code points.
Using the existing octet-based percent-encoding mechanism does not
need an upgrade of the URI syntax and does not need corresponding
server upgrades.
A.4. Indicating Character Encodings in the URI/IRI
Some proposals suggested indicating the character encodings used in
an URI or IRI with some new syntactic convention in the URI itself,
similar to the "charset" parameter for e-mails and Web pages. As an
example, the label in square brackets in
"http://www.example.org/ros[iso-8859-1]&#xE9;" indicated that the
following "&#xE9;" had to be interpreted as iso-8859-1.
If UTF-8 is used exclusively, an upgrade to the URI syntax is not
needed. It avoids potentially multiple labels that have to be copied
correctly in all cases, even on the side of a bus or on a napkin,
leading to usability problems (and being prohibitively annoying).
Exclusively using UTF-8 also reduces transcoding errors and
confusion.
Authors' Addresses Authors' Addresses
Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever Martin Duerst
possible, for example as "D&#252;rst" in XML and HTML)
Aoyama Gakuin University Aoyama Gakuin University
5-10-1 Fuchinobe 5-10-1 Fuchinobe
Sagamihara, Kanagawa 229-8558 Sagamihara, Kanagawa 229-8558
Japan Japan
Phone: +81 42 759 6329 Phone: +81 42 759 6329
Fax: +81 42 759 6495 Fax: +81 42 759 6495
Email: duerst@it.aoyama.ac.jp Email: duerst@it.aoyama.ac.jp
URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/ URI: http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/
(Note: This is the percent-encoded form of an IRI)
Michel Suignard Michel Suignard
Unicode Consortium Unicode Consortium
P.O. Box 391476 P.O. Box 391476
Mountain View, CA 94039-1476 Mountain View, CA 94039-1476
U.S.A. U.S.A.
Phone: +1-650-693-3921 Phone: +1-650-693-3921
Email: michel@unicode.org Email: michel@unicode.org
URI: http://www.suignard.com URI: http://www.suignard.com
 End of changes. 28 change blocks. 
146 lines changed or deleted 102 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/