draft-ietf-idn-uri-00.txt   draft-ietf-idn-uri-01.txt 
INTERNET-DRAFT Martin Duerst INTERNET-DRAFT Martin Duerst
draft-ietf-idn-uri-00 W3C/Keio University draft-ietf-idn-uri-01 W3C/Keio University
Expires July 2001 January 6, 2001 Expires May 2002 November 20, 2001
Internationalized Domain Names in URIs and IRIs Internationalized Domain Names in URIs and IRIs
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026. provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other Force (IETF), its areas, and its working groups. Note that other
skipping to change at line 29 skipping to change at line 29
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Abstract Abstract
This document is a first draft for the provisions necessary to This document proposes to upgrade the definitions of URIs [RFC 2396]
upgrade the definitions of URIs [RFC 2396] and IRIs (Internationalized and IRIs (Internationalized Resource Identifiers, [IRI]) to work
Resource Identifiers, [IRI]) to work with internationalized domain consistently with internationalized domain names.
names.
0. Change Log
0.1 Changes from -00 to -01
- Changed requirement for URI/IRI resolvers from MUST to SHOULD
- Changed IRI syntax slightly (ichar -> idchar, based on changes
in [IRI])
- Various wording changes
1. Introduction 1. Introduction
Internet domain names serve to identify hosts and services on the Internet domain names serve to identify hosts and services on the
Internet in a convenient way. The IETF IDN working group is currently Internet in a convenient way. The IETF IDN working group is currently
working on extending the character repertoire usable in domain names working on extending the character repertoire usable in domain names
beyond a subset of US-ASCII. beyond a subset of US-ASCII.
One of the most important places where domain names appear are One of the most important places where domain names appear are
Uniform Resource Identifiers (URIs, [RFC 2396], as modified by Uniform Resource Identifiers (URIs, [RFC 2396], as modified by
[RFC2732]). However, in the current definition of the generic URI [RFC2732]). However, in the current definition of the generic URI
syntax, the restrictions on domain names are 'hard-coded'. This syntax, the restrictions on domain names are 'hard-coded'. In
document proposes to relax these restrictions by updating the syntax, Section 2, this document relaxes these restrictions by updating
and defines how internationalized domain names are encoded in URIs. the syntax, and defines how internationalized domain names are
encoded in URIs.
URIs themselves are restricted to a subset of US-ASCII. However, URIs are restricted to a subset of US-ASCII. However, IRIs
there is a proposal for relieving these restrictions by creating (Internationalized Resource Identifier [IRI]) in general allow
a new protocol element called an IRI (Internationalized Resource non-ASCII characters. But the syntax of IRIs has the same 'hard-coded'
Identifier [IRI]). While IRIs in general allow the use of non-ASCII restrictions on domain names as the syntax of URIs. In Section 3,
characters, the syntax of IRIs has the same restriction for domain this document relaxes these restrictions by updating the IRI syntax.
names as the syntaxt of URIs. This document proposes to relax these This is done in a way that is compatible with the new syntax for URIs.
restrictions, too, in a way that is compatible with the new syntax This means that encoding an internationalized domain name in an URI
for URIs. This means that encoding an internationalized domain name in and encoding the same domain name in an IRI will produce an URI and an
an URI and encoding the same name in an IRI will produce an URI and an
IRI that can be converted into each other using the procedures defined IRI that can be converted into each other using the procedures defined
in [IRI] for these conversions. in [IRI] for these conversions.
2. URI syntax changes 2. URI syntax changes
The syntax of URIs [RFC2326] currently contains the following rules The syntax of URIs [RFC2326] currently contains the following rules
relevant to domain names: relevant to domain names:
hostname = *( domainlabel "." ) toplabel [ "." ] hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
skipping to change at line 86 skipping to change at line 94
and the following rules are added: and the following rules are added:
escalphanum = escaped8 | alphanum escalphanum = escaped8 | alphanum
escalpha = elcaped8 | alpha escalpha = elcaped8 | alpha
escaped8 = "%" hexdig8 HEXDIG escaped8 = "%" hexdig8 HEXDIG
hexdig8 = <<HEXDIG greater than 7>> hexdig8 = <<HEXDIG greater than 7>>
The %HH escaping is used to encode characters outside the repertoire The %HH escaping is used to encode characters outside the repertoire
of US-ASCII. This is done by first encoding the characters in UTF-8 of US-ASCII. This is done by first encoding the characters in UTF-8
[RFC 2279], resulting in a sequence of octets, and then escaping these [RFC 2279], resulting in a sequence of octets, and then escaping these
octets. octets according to the rules defined in [RFC2396].
Using UTF-8 assures that this encoding interoperates with IRIs (see Using UTF-8 assures that this encoding interoperates with IRIs (see
Section 3). It is also alligned with the recommendations in [RFC 2277] Section 3). It is also aligned with the recommendations in [RFC 2277]
and [RFC 2718], and is consistent with the URN syntax [RFC2141] as and [RFC 2718], and is consistent with the URN syntax [RFC2141] as
well as recent URL scheme definitions that define encodings of well as recent URL scheme definitions that define encodings of
non-ASCII characters based on (e.g., IMAP URLs [RFC 2192] and POP URLs non-ASCII characters based on UTF-8 (e.g., IMAP URLs [RFC 2192] and
[RFC 2384]). POP URLs [RFC 2384]).
Please note that the use of UTF-8 for encoding internationalized Please note that the use of UTF-8 for encoding internationalized
domain names in URIs is independent of the choice of encoding chosen domain names in URIs is independent of the choice of encoding chosen
for these names in the DNS protocol. In case something else than UTF-8 for these names in the DNS protocol. Depending on the choice of
is chosen for the later, a future version of this document may give encoding for the DNS protocol, an appropriate conversion is necessary.
instructions for the conversion if deemed necessary.
The above syntax rules do not extend the possible domain names based The above syntax rules do not extend the possible domain names based
on US-ASCII characters. This may have to be changed in case the IDN on US-ASCII characters. This is in accordance with the current direction
WG should decide to allow such extensions. of the IDN WG [IDNWG].
The above rules also do not allow escaping of US-ASCII characters, The above rules also do not allow escaping of US-ASCII characters,
although this is allowed in the other parts of an URI (except for the although this is allowed in the other parts of an URI (except for the
special provisions in case of reserved characters). Allowing such special provisions in case of reserved characters). Allowing such
escaping would make the syntax rules quite a bit more complicated, escaping would make the syntax rules quite a bit more complicated,
would mean that the restrictions on US-ASCII characters can be would mean that the restrictions on US-ASCII characters can be
circumvented by using escaping, or would lead to much simpler syntax circumvented by using escaping, or would lead to much simpler syntax
rules that don't express these restrictions anymore. Even in case rules that don't express these restrictions anymore.
escaping of US-ASCII characters is allowed in order to simplify
processing, it should be noted that it is always better not to escape
US-ASCII characters in domain names because of the possibility that
a resolver cannot unescape them. At least purely US-ASCII domain names
would then always be resolved by such a processor.
While only the restrictions on US-ASCII characters are expressed in the Whether escaping of US-ASCII characters is allowed or not, two things
rules above, all the other restrictions on internationalized should be noted: 1) It is always better not to escape US-ASCII characters
domain names that will be defined by the IDN WG MUST be respected. in domain names because of the possibility that a resolver does not unescape
them. At least purely US-ASCII domain names would then always be resolved
by such a processor. 2) Because of the principle of syntax uniformity for
URIs, it is always more prudent to take into account the possibility that
US-ASCII characters are escaped.
Only the restrictions on US-ASCII characters are expressed in the
rules above. However, all the other restrictions on internationalized
domain names that are defined by the IDN WG [IDNWG] MUST be respected.
The work of the IDN WG currently includes some procedures for name The work of the IDN WG currently includes some procedures for name
preparation. Before encoding an internationalized domain name in an preparation. Before encoding an internationalized domain name in an
URI, this preparation step SHOULD be applied. However, the resolver URI, this preparation step SHOULD be applied. However, the URI resolver
MUST also apply name preparation. SHOULD also apply name preparation.
2. IRI syntax changes 3. IRI syntax changes
The syntax of IRIs [IRI] currently contains the following rules The syntax of IRIs [IRI] currently contains the following rules
relevant to domain names: relevant to domain names:
hostname = *( domainlabel "." ) toplabel [ "." ] hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum toplabel = alpha | alpha *( alphanum | "-" ) alphanum
The later two rules are changed as follows: The later two rules are changed as follows:
domainlabel = intalphanum | intalphanum *( intalphanum | "-" ) domainlabel = intalphanum | intalphanum *( intalphanum | "-" )
intalphanum intalphanum
toplabel = intalpha | intalpha *( intalphanum | "-" ) toplabel = intalpha | intalpha *( intalphanum | "-" )
intalphanum intalphanum
and the following rules are added: and the following rules are added:
intalphanum = ichar | alphanum | escaped8 intalphanum = idchar | alphanum | escaped8
intalpha = ichar | alpha | escaped8 intalpha = idchar | alpha | escaped8
escaped8 = "%" hexdig8 HEXDIG escaped8 = "%" hexdig8 HEXDIG
hexdig8 = <<HEXDIG greater than 7>> hexdig8 = <<HEXDIG greater than 7>>
idchar = << any character of the UCS [ISO10646] of U+00A0
where ichar, as in [IRI], is: and beyond, subject to limitations in Section
ichar = << any character of UCS [ISO10646] beyond
U+0080, subject to limitations in Section
3.1. of [IRI] >> 3.1. of [IRI] >>
With respect to the allowed domain names based on US-ASCII characters, With respect to the allowed domain names based on US-ASCII characters,
the same considerations as in Section 2 apply. the same considerations as in Section 2 apply.
As in Section 2, all the other restrictions on internationalized As in Section 2, all the other restrictions on internationalized
domain names that will be defined by the IDN WG MUST be respected. domain names that will be defined by the IDN WG MUST be respected.
Also, before encoding an internationalized domain name in an IRI, Also, before encoding an internationalized domain name in an IRI,
name preparation SHOULD be applied. However, the IRI resolver MUST name preparation SHOULD be applied. However, the IRI resolver SHOULD
also apply name preparation. also apply name preparation.
It is expected that the rules in Section 3.1 of [IRI] will be less It is expected that the rules in Section 3.1 of [IRI] will be less
restrictive than the rules for internationalized domain names, so that restrictive than the rules for internationalized domain names, so that
no escaping is necessary. Nevertheless, escaping is allowed for cases no escaping is necessary. Nevertheless, escaping is allowed for cases
where not all characters can be directly represented. where not all characters can be directly represented.
4. Security Considerations 4. Security Considerations
Besides the security considerations of [RFC 2396] and [IRI] and those The security considerations of [RFC 2396] and [IRI] and those applying
applying to the various aspects of internationalized domain names in to internationalized domain names apply. There may be an increased
general, there are currently no known security problems. potential to smuggle escaped US-ASCII-based domain names across
firewalls, although because of the uniform syntax principle for
URIs, such a potential is already existing.
Acknowledgements Acknowledgements
To be done. Looking forward for comments. Will acknowledge them here!
Copyright Copyright
Copyright (C) The Internet Society, 1997. All Rights Reserved. Copyright (C) The Internet Society, 1997. All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph kind, provided that the above copyright notice and this paragraph
skipping to change at line 223 skipping to change at line 232
252-8520 Japan 252-8520 Japan
duerst@w3.org duerst@w3.org
http://www.w3.org/People/D%C3%BCrst/ http://www.w3.org/People/D%C3%BCrst/
Tel/Fax: +81 466 49 1170 Tel/Fax: +81 466 49 1170
Note: Please write "Duerst" with u-umlaut wherever Note: Please write "Duerst" with u-umlaut wherever
possible, e.g. as "D&#252;rst" in XML and HTML. possible, e.g. as "D&#252;rst" in XML and HTML.
References References
[IDNWG] IETF Internationalized Domain Name (idn) Working Group.
Information at http://www.ietf.org/html.charters/idn-charter.html.
[IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers [IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers
(IRI)", Internet Draft, January 2001, (IRI)", Internet Draft, November 2001,
<http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-06.txt>, <http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt>,
work in progress. work in progress.
[ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet [ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet
Coded Character Set (UCS) - Part 1: Architecture and Basic Coded Character Set (UCS) - Part 1: Architecture and Basic
Multilingual Plane, Oct. 2000, with amendments. Multilingual Plane, Oct. 2000, with amendments.
[RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate [RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997. Requirement Levels", March 1997.
[RFC 2141] R. Moats, "URN Syntax", May 1997. [RFC 2141] R. Moats, "URN Syntax", May 1997.
skipping to change at line 248 skipping to change at line 260
[RFC 2277] H. Alvestrad, "IETF Policy on Character Sets and [RFC 2277] H. Alvestrad, "IETF Policy on Character Sets and
Languages". Languages".
[RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646.", [RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646.",
January 1998. January 1998.
[RFC 2384] R. Gellens, "POP URL Scheme", August 1998. [RFC 2384] R. Gellens, "POP URL Scheme", August 1998.
[RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform Resource [RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform Resource
Identifiers (URI): Generic Syntax." August, 1998. Identifiers (URI): Generic Syntax." August 1998.
[RFC 2640] B. Curtis, "Internationalization of the File Transfer [RFC 2640] B. Curtis, "Internationalization of the File Transfer
Protocol", July 1999. Protocol", July 1999.
[RFC 2718] L. Masinter, H. Alvestrand, D. Zigmond, R. Petke, [RFC 2718] L. Masinter, H. Alvestrand, D. Zigmond, R. Petke,
"Guidelines for new URL Schemes", November 1999. "Guidelines for new URL Schemes", November 1999.
[RFC 2732] R. Hinden, B. Carpenter, L. Masinter, "Format for Literal [RFC 2732] R. Hinden, B. Carpenter, L. Masinter, "Format for Literal
IPv6 Addresses in URL's", December 1999. IPv6 Addresses in URL's", December 1999.
 End of changes. 21 change blocks. 
54 lines changed or deleted 66 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/