draft-ietf-idn-uri-01.txt   draft-ietf-idn-uri-02.txt 
INTERNET-DRAFT Martin Duerst
draft-ietf-idn-uri-01 W3C/Keio University
Expires May 2002 November 20, 2001
Internationalized Domain Names in URIs and IRIs Network Working Group M. Duerst
Internet-Draft W3C/Keio University
Expires: December 30, 2002 July 1, 2002
Internationalized Domain Names in URIs
draft-ietf-idn-uri-02
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with
provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering
Force (IETF), its areas, and its working groups. Note that other Task Force (IETF), its areas, and its working groups. Note that
groups may also distribute working documents as Internet-Drafts. other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet- Drafts as reference time. It is inappropriate to use Internet- Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at http://
http://www.ietf.org/ietf/1id-abstracts.txt. www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Abstract This Internet-Draft will expire on December 30, 2002.
This document proposes to upgrade the definitions of URIs [RFC 2396] Copyright Notice
and IRIs (Internationalized Resource Identifiers, [IRI]) to work
consistently with internationalized domain names.
0. Change Log Copyright (C) The Internet Society (2002). All Rights Reserved.
0.1 Changes from -00 to -01 Abstract
- Changed requirement for URI/IRI resolvers from MUST to SHOULD This document proposes to upgrade the definition of URIs (RFC 2396)
- Changed IRI syntax slightly (ichar -> idchar, based on changes [RFC2396] to work consistently with internationalized domain names.
in [IRI])
- Various wording changes Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. URI syntax changes . . . . . . . . . . . . . . . . . . . . . . 3
3. Security considerations . . . . . . . . . . . . . . . . . . . 5
4. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 Changes from draft-ietf-idn-uri--01 to draft-ietf-idn-uri-02 . 5
4.2 Changes from draft-ietf-idn-uri--00 to draft-ietf-idn-uri-01 . 5
References . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Author's Address . . . . . . . . . . . . . . . . . . . . . . . 7
Full Copyright Statement . . . . . . . . . . . . . . . . . . . 8
1. Introduction 1. Introduction
Internet domain names serve to identify hosts and services on the Internet domain names serve to identify hosts and services on the
Internet in a convenient way. The IETF IDN working group is currently Internet in a convenient way. The IETF IDN working group [IDNWG] has
working on extending the character repertoire usable in domain names been working on extending the character repertoire usable in domain
beyond a subset of US-ASCII. names beyond a subset of US-ASCII.
One of the most important places where domain names appear are One of the most important places where domain names appear are
Uniform Resource Identifiers (URIs, [RFC 2396], as modified by Uniform Resource Identifiers (URIs, [RFC 2396], as modified by
[RFC2732]). However, in the current definition of the generic URI [RFC2732]). However, in the current definition of the generic URI
syntax, the restrictions on domain names are 'hard-coded'. In syntax, the restrictions on domain names are 'hard-coded'. In
Section 2, this document relaxes these restrictions by updating Section 2, this document relaxes these restrictions by updating the
the syntax, and defines how internationalized domain names are syntax, and defines how internationalized domain names are encoded in
encoded in URIs. URIs.
URIs are restricted to a subset of US-ASCII. However, IRIs The syntax in this document has been choosen to further increase the
(Internationalized Resource Identifier [IRI]) in general allow uniformity of URI syntax, which is a very important principle of
non-ASCII characters. But the syntax of IRIs has the same 'hard-coded' URIs.
restrictions on domain names as the syntax of URIs. In Section 3,
this document relaxes these restrictions by updating the IRI syntax. In practice, escaped domanin names should be used as rarely as
This is done in a way that is compatible with the new syntax for URIs. possible. Wherever possible, the actual characters in
This means that encoding an internationalized domain name in an URI Internationalized Domain Names should be preserved as long as
and encoding the same domain name in an IRI will produce an URI and an possible by using IRIs [IRI] rather than URIs, and only converting to
IRI that can be converted into each other using the procedures defined URIs and then to ACE-encoded [IDNA] domain names (or ideally directly
in [IRI] for these conversions. to ACE-encoding without even using URIs) when resolving the IRI.
Also, this document does in no way exclude the use of ACE encoding
directly in an URI domain name part. ACE encoding may be used
directly in an URI domain name part if this is considered necessary
for interoperability.
Please note that even with the definition of URIs in [RFC2396], some
URIs can already contain host names with escaped characters. For
example, mailto:example@w%33.org is legal per [RFC2396] because the
mailto: URI scheme does not follow the generic syntax of [RFC2396].
2. URI syntax changes 2. URI syntax changes
The syntax of URIs [RFC2326] currently contains the following rules The syntax of URIs [RFC2396] currently contains the following rules
relevant to domain names: relevant to domain names:
hostname = *( domainlabel "." ) toplabel [ "." ] hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum toplabel = alpha | alpha *( alphanum | "-" ) alphanum
The later two rules are changed as follows: The later two rules are changed as follows:
domainlabel = escalphanum | escalphanum *( escalphanum | "-" ) domainlabel = anchar | anchar *( anchar | "-" ) anchar
escalphanum toplabel = achar | achar *( anchar | "-" ) anchar
toplabel = escalpha | escalpha *( escalphanum | "-" )
escalphanum
and the following rules are added: and the following rules are added:
escalphanum = escaped8 | alphanum anchar = alphanum | escaped
escalpha = elcaped8 | alpha achar = alpha | escaped
escaped8 = "%" hexdig8 HEXDIG
hexdig8 = <<HEXDIG greater than 7>>
The %HH escaping is used to encode characters outside the repertoire Characters outside the repertoire (alphanum) are encoded by first
of US-ASCII. This is done by first encoding the characters in UTF-8 encoding the characters in UTF-8 [RFC 2279], resulting in a sequence
[RFC 2279], resulting in a sequence of octets, and then escaping these of octets, and then escaping these octets according to the rules
octets according to the rules defined in [RFC2396]. defined in [RFC2396].
Using UTF-8 assures that this encoding interoperates with IRIs (see Using UTF-8 assures that this encoding interoperates with IRIs [IRI].
Section 3). It is also aligned with the recommendations in [RFC 2277] It is also aligned with the recommendations in [RFC2277] and
and [RFC 2718], and is consistent with the URN syntax [RFC2141] as [RFC2718], and is consistent with the URN syntax [RFC2141] as well as
well as recent URL scheme definitions that define encodings of recent URL scheme definitions that define encodings of non-ASCII
non-ASCII characters based on UTF-8 (e.g., IMAP URLs [RFC 2192] and characters based on UTF-8 (e.g., IMAP URLs [RFC2192] and POP URLs
POP URLs [RFC 2384]). [RFC2384]).
Please note that the use of UTF-8 for encoding internationalized The above syntax rules permit for domain names that are neither
domain names in URIs is independent of the choice of encoding chosen permitted as US-ASCII only domain names nor as internationalized
for these names in the DNS protocol. Depending on the choice of domain names. However, such syntax should never be used, and will
encoding for the DNS protocol, an appropriate conversion is necessary. always be rejected by resolvers. For US-ASCII only domain names, the
syntax rules in [RFC2396] are relevant. For example, http://
www.w%33.org is legal, because the corresponding 'w3' is a legal
'domainlabel' according to [RFC2396]. However, http://
%2a.example.org is illegal because the corresponding '*' is not a
legal 'domainlabel' according to [RFC2396]. For domain names
containing non-ASCII characters, the legal domain names are those for
which the ToASCII operation ([IDNA], [Nameprep]; using the unescaped
UTF-8 values as input) is successful.
The above syntax rules do not extend the possible domain names based For consistency in comparison operations and for interoperability
on US-ASCII characters. This is in accordance with the current direction with older software, the following should be noted: 1) US-ASCII
of the IDN WG [IDNWG]. characters in domain names should not be escaped. 2) Because of the
principle of syntax uniformity for URIs, it is always more prudent to
take into account the possibility that US-ASCII characters are
escaped.
The above rules also do not allow escaping of US-ASCII characters, The work of the IDN WG includes some procedures for name preparation
although this is allowed in the other parts of an URI (except for the [Nameprep]. Before encoding an internationalized domain name in an
special provisions in case of reserved characters). Allowing such URI, this preparation step SHOULD be applied. However, the URI
escaping would make the syntax rules quite a bit more complicated, resolver MUST also apply any steps required as part of domain name
would mean that the restrictions on US-ASCII characters can be resolution by [IDNA].
circumvented by using escaping, or would lead to much simpler syntax
rules that don't express these restrictions anymore.
Whether escaping of US-ASCII characters is allowed or not, two things 3. Security considerations
should be noted: 1) It is always better not to escape US-ASCII characters
in domain names because of the possibility that a resolver does not unescape
them. At least purely US-ASCII domain names would then always be resolved
by such a processor. 2) Because of the principle of syntax uniformity for
URIs, it is always more prudent to take into account the possibility that
US-ASCII characters are escaped.
Only the restrictions on US-ASCII characters are expressed in the The security considerations of [RFC2396] and those applying to
rules above. However, all the other restrictions on internationalized internationalized domain names apply. There may be an increased
domain names that are defined by the IDN WG [IDNWG] MUST be respected. potential to smuggle escaped US-ASCII-based domain names across
firewalls, although because of the uniform syntax principle for URIs,
such a potential is already existing.
The work of the IDN WG currently includes some procedures for name 4. Change Log
preparation. Before encoding an internationalized domain name in an
URI, this preparation step SHOULD be applied. However, the URI resolver
SHOULD also apply name preparation.
3. IRI syntax changes 4.1 Changes from draft-ietf-idn-uri--01 to draft-ietf-idn-uri-02
The syntax of IRIs [IRI] currently contains the following rules Moved change log to back
relevant to domain names:
hostname = *( domainlabel "." ) toplabel [ "." ] Changed to only change URIs; IRI syntax updated directly in IRI
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum draft.
toplabel = alpha | alpha *( alphanum | "-" ) alphanum
The later two rules are changed as follows: Removed syntax restriction on %hh in the US-ASCII part, but made
clear that restrictions to domain names apply.
domainlabel = intalphanum | intalphanum *( intalphanum | "-" ) Made clear that escaped domain names in URIs should only be an
intalphanum intermediate representation.
toplabel = intalpha | intalpha *( intalphanum | "-" )
intalphanum
and the following rules are added: Gave example of mailto: as already allowing escaped host names.
intalphanum = idchar | alphanum | escaped8 4.2 Changes from draft-ietf-idn-uri--00 to draft-ietf-idn-uri-01
intalpha = idchar | alpha | escaped8
escaped8 = "%" hexdig8 HEXDIG
hexdig8 = <<HEXDIG greater than 7>>
idchar = << any character of the UCS [ISO10646] of U+00A0
and beyond, subject to limitations in Section
3.1. of [IRI] >>
With respect to the allowed domain names based on US-ASCII characters, Changed requirement for URI/IRI resolvers from MUST to SHOULD
the same considerations as in Section 2 apply.
As in Section 2, all the other restrictions on internationalized Changed IRI syntax slightly (ichar -> idchar, based on changes in
domain names that will be defined by the IDN WG MUST be respected. [IRI])
Also, before encoding an internationalized domain name in an IRI,
name preparation SHOULD be applied. However, the IRI resolver SHOULD
also apply name preparation.
It is expected that the rules in Section 3.1 of [IRI] will be less Various wording changes
restrictive than the rules for internationalized domain names, so that
no escaping is necessary. Nevertheless, escaping is allowed for cases
where not all characters can be directly represented.
4. Security Considerations References
The security considerations of [RFC 2396] and [IRI] and those applying [IDNA] Faltstrom, P., Hoffman, P. and A. Costello,
to internationalized domain names apply. There may be an increased "Internationalizing Domain Names in Applications (IDNA)",
potential to smuggle escaped US-ASCII-based domain names across draft-ietf-idn-idna-09.txt (work in progress), May 2002,
firewalls, although because of the uniform syntax principle for <http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-
URIs, such a potential is already existing. 09.txt>.
Acknowledgements [IDNWG] "IETF Internationalized Domain Name (idn) Working Group".
Looking forward for comments. Will acknowledge them here! [IRI] Duerst, M. and M. Suignard, "Internationalized Resource
Identifiers (IRI)", draft-duerst-iri-01 (work in
progress), July 2002.
Copyright [ISO10646] International Organization for Standardization,
"Information Technology - Universal Multiple-Octet Coded
Character Set (UCS) - Part 1: Architecture and Basic
Multilingual Plane", ISO Standard 10646-1, October 2000.
Copyright (C) The Internet Society, 1997. All Rights Reserved. [Nameprep] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
Profile for Internationalized Domain Names", draft-ietf-
idn-nameprep-10.txt (work in progress), May 2002, <http:/
/www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-
10.txt>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
[RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
Languages", BCP 18, RFC 2277, January 1998.
[RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO
10646", RFC 2279, January 1998.
[RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998.
[RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396,
August 1998.
[RFC2640] Curtin, B., "Internationalization of the File Transfer
Protocol", RFC 2640, July 1999.
[RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke,
"Guidelines for new URL Schemes", RFC 2718, November
1999.
[RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for
Literal IPv6 Addresses in URL's", RFC 2732, December
1999.
Author's Address
Martin Duerst
W3C/Keio University
5322 Endo
Fujisawa 252-8520
Japan
Phone: +81 466 49 1170
Fax: +81 466 49 1171
EMail: duerst@w3.org
URI: http://www.w3.org/People/D%C3%BCrst/
Full Copyright Statement
Copyright (C) The Internet Society (2002). All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph kind, provided that the above copyright notice and this paragraph are
are included on all such copies and derivative works. However, this included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other followed, or as required to translate it into languages other than
than English. English.
The limited permissions granted above are perpetual and will not be The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns. revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Author's address
Martin J. Duerst
W3C/Keio University
5322 Endo, Fujisawa
252-8520 Japan
duerst@w3.org
http://www.w3.org/People/D%C3%BCrst/
Tel/Fax: +81 466 49 1170
Note: Please write "Duerst" with u-umlaut wherever
possible, e.g. as "D&#252;rst" in XML and HTML.
References
[IDNWG] IETF Internationalized Domain Name (idn) Working Group.
Information at http://www.ietf.org/html.charters/idn-charter.html.
[IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers
(IRI)", Internet Draft, November 2001,
<http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt>,
work in progress.
[ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet
Coded Character Set (UCS) - Part 1: Architecture and Basic
Multilingual Plane, Oct. 2000, with amendments.
[RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997.
[RFC 2141] R. Moats, "URN Syntax", May 1997.
[RFC 2192] C. Newman, "IMAP URL Scheme", September 1997.
[RFC 2277] H. Alvestrad, "IETF Policy on Character Sets and
Languages".
[RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646.",
January 1998.
[RFC 2384] R. Gellens, "POP URL Scheme", August 1998.
[RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform Resource
Identifiers (URI): Generic Syntax." August 1998.
[RFC 2640] B. Curtis, "Internationalization of the File Transfer
Protocol", July 1999.
[RFC 2718] L. Masinter, H. Alvestrand, D. Zigmond, R. Petke, Acknowledgement
"Guidelines for new URL Schemes", November 1999.
[RFC 2732] R. Hinden, B. Carpenter, L. Masinter, "Format for Literal Funding for the RFC Editor function is currently provided by the
IPv6 Addresses in URL's", December 1999. Internet Society.
 End of changes. 46 change blocks. 
180 lines changed or deleted 188 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/