INTERNET-DRAFT                                          Martin
Network Working Group                                          M. Duerst
draft-ietf-idn-uri-01
Internet-Draft                                       W3C/Keio University
Expires May
Expires: December 30, 2002                                  July 1, 2002                                    November 20, 2001

                  Internationalized Domain Names in URIs and IRIs
                          draft-ietf-idn-uri-02

Status of this Memo

    This document is an Internet-Draft and is in full conformance with
    all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as Internet-Drafts. Internet-
    Drafts.

    Internet-Drafts are draft documents valid for a maximum of six months
    and may be updated, replaced, or obsoleted by other documents at any
    time.  It is inappropriate to use Internet- Drafts Internet-Drafts as reference
    material or to cite them other than as "work in progress."

    The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://
    www.ietf.org/ietf/1id-abstracts.txt.

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html.

    This Internet-Draft will expire on December 30, 2002.

Copyright Notice

    Copyright (C) The Internet Society (2002).  All Rights Reserved.

Abstract

    This document proposes to upgrade the definitions definition of URIs [RFC 2396]
and IRIs (Internationalized Resource Identifiers, [IRI]) (RFC 2396)
    [RFC2396] to work consistently with internationalized domain names.

0.

Table of Contents

    1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
    2.  URI syntax changes . . . . . . . . . . . . . . . . . . . . . .  3
    3.  Security considerations  . . . . . . . . . . . . . . . . . . .  5
    4.  Change Log

0.1 . . . . . . . . . . . . . . . . . . . . . . . . . .  5
    4.1 Changes from -00 draft-ietf-idn-uri--01 to -01

- Changed requirement for URI/IRI resolvers draft-ietf-idn-uri-02 .  5
    4.2 Changes from MUST draft-ietf-idn-uri--00 to SHOULD
- Changed IRI syntax slightly (ichar -> idchar, based on changes
   in [IRI])
- Various wording changes draft-ietf-idn-uri-01 .  5
        References . . . . . . . . . . . . . . . . . . . . . . . . . .  5
        Author's Address . . . . . . . . . . . . . . . . . . . . . . .  7
        Full Copyright Statement . . . . . . . . . . . . . . . . . . .  8

1. Introduction

    Internet domain names serve to identify hosts and services on the
    Internet in a convenient way.  The IETF IDN working group is currently [IDNWG] has
    been working on extending the character repertoire usable in domain
    names beyond a subset of US-ASCII.

    One of the most important places where domain names appear are
    Uniform Resource Identifiers (URIs, [RFC 2396], [RFC2396], as modified by
    [RFC2732]).  However, in the current definition of the generic URI
    syntax, the restrictions on domain names are 'hard-coded'.  In
    Section 2, this document relaxes these restrictions by updating the
    syntax, and defines how internationalized domain names are encoded in
    URIs.

URIs are restricted

    The syntax in this document has been choosen to further increase the
    uniformity of URI syntax, which is a subset very important principle of US-ASCII. However, IRIs
(Internationalized Resource Identifier [IRI]) in general allow
non-ASCII characters. But
    URIs.

    In practice, escaped domanin names should be used as rarely as
    possible.  Wherever possible, the syntax of actual characters in
    Internationalized Domain Names should be preserved as long as
    possible by using IRIs has the same 'hard-coded'
restrictions on [IRI] rather than URIs, and only converting to
    URIs and then to ACE-encoded [IDNA] domain names as (or ideally directly
    to ACE-encoding without even using URIs) when resolving the syntax of URIs. In Section 3, IRI.
    Also, this document relaxes these restrictions by updating the IRI syntax.
This is done does in a no way that is compatible with exclude the new syntax for URIs.
This means that use of ACE encoding an internationalized domain name
    directly in an URI
and encoding the same domain name part.  ACE encoding may be used
    directly in an IRI will produce an URI and an
IRI domain name part if this is considered necessary
    for interoperability.

    Please note that can be converted into each other using even with the procedures defined definition of URIs in [IRI] for these conversions. [RFC2396], some
    URIs can already contain host names with escaped characters.  For
    example, mailto:example@w%33.org is legal per [RFC2396] because the
    mailto: URI scheme does not follow the generic syntax of [RFC2396].

2. URI syntax changes

    The syntax of URIs [RFC2326] [RFC2396] currently contains the following rules
    relevant to domain names:

           hostname      = *( domainlabel "." ) toplabel [ "." ]
           domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
           toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
    The later two rules are changed as follows:

           domainlabel   = escalphanum anchar | escalphanum anchar *( escalphanum anchar | "-" )
                       escalphanum anchar
           toplabel      = escalpha achar | escalpha achar *( escalphanum anchar | "-" )
                       escalphanum anchar

    and the following rules are added:

       escalphanum

    		 anchar        = escaped8 | alphanum
       escalpha      = elcaped8 | alpha
       escaped8      = "%" hexdig8 HEXDIG
       hexdig8 escaped
    		 achar         = <<HEXDIG greater than 7>>

The %HH escaping is used to encode characters alpha | escaped

    Characters outside the repertoire
of US-ASCII. This is done (alphanum) are encoded by first
    encoding the characters in UTF-8 [RFC 2279], resulting in a sequence
    of octets, and then escaping these octets according to the rules
    defined in [RFC2396].

    Using UTF-8 assures that this encoding interoperates with IRIs (see
Section 3). [IRI].
    It is also aligned with the recommendations in [RFC 2277] [RFC2277] and [RFC 2718],
    [RFC2718], and is consistent with the URN syntax [RFC2141] as well as
    recent URL scheme definitions that define encodings of non-ASCII
    characters based on UTF-8 (e.g., IMAP URLs [RFC 2192] [RFC2192] and POP URLs [RFC 2384]).

Please note that the use of UTF-8
    [RFC2384]).

    The above syntax rules permit for encoding internationalized domain names in URIs is independent of the choice of encoding chosen
for these that are neither
    permitted as US-ASCII only domain names in nor as internationalized
    domain names.  However, such syntax should never be used, and will
    always be rejected by resolvers.  For US-ASCII only domain names, the DNS protocol. Depending on
    syntax rules in [RFC2396] are relevant.  For example, http://
    www.w%33.org is legal, because the choice of
encoding for corresponding 'w3' is a legal
    'domainlabel' according to [RFC2396].  However, http://
    %2a.example.org is illegal because the DNS protocol, an appropriate conversion corresponding '*' is necessary.

The above syntax rules do not extend a
    legal 'domainlabel' according to [RFC2396].  For domain names
    containing non-ASCII characters, the possible legal domain names based
on US-ASCII characters. This is in accordance with are those for
    which the current direction
of ToASCII operation ([IDNA], [Nameprep]; using the IDN WG [IDNWG].

The above rules also do not allow escaping of US-ASCII characters,
although this unescaped
    UTF-8 values as input) is allowed successful.

    For consistency in the other parts of an URI (except comparison operations and for interoperability
    with older software, the
special provisions in case of reserved characters). Allowing such
escaping would make the syntax rules quite a bit more complicated,
would mean that the restrictions on US-ASCII characters can be
circumvented by using escaping, or would lead to much simpler syntax
rules that don't express these restrictions anymore.

Whether escaping of US-ASCII characters is allowed or not, two things following should be noted: 1) It is always better not to escape US-ASCII
    characters in domain names because of the possibility that a resolver does should not unescape
them. At least purely US-ASCII domain names would then always be resolved
by such a processor. escaped.  2) Because of the
    principle of syntax uniformity for URIs, it is always more prudent to
    take into account the possibility that US-ASCII characters are
    escaped.

Only the restrictions on US-ASCII characters are expressed in the
rules above. However, all the other restrictions on internationalized
domain names that are defined by the IDN WG [IDNWG] MUST be respected.

    The work of the IDN WG currently includes some procedures for name
preparation. preparation
    [Nameprep].  Before encoding an internationalized domain name in an
    URI, this preparation step SHOULD be applied.  However, the URI
    resolver
SHOULD MUST also apply any steps required as part of domain name preparation.
    resolution by [IDNA].

3. IRI syntax changes Security considerations

    The syntax security considerations of IRIs [IRI] currently contains the following rules
relevant [RFC2396] and those applying to
    internationalized domain names:

       hostname      = *( domainlabel "." ) toplabel [ "." ]
       domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
       toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

The later two rules are changed as follows:

       domainlabel   = intalphanum | intalphanum *( intalphanum | "-" )
                       intalphanum
       toplabel      = intalpha | intalpha *( intalphanum | "-" )
                       intalphanum

and the following rules are added:

       intalphanum   = idchar | alphanum | escaped8
       intalpha      = idchar | alpha | escaped8
       escaped8      = "%" hexdig8 HEXDIG
       hexdig8       = <<HEXDIG greater than 7>>
       idchar        = << any character of the UCS [ISO10646] of U+00A0
                          and beyond, subject to limitations in Section
                          3.1. of [IRI] >>

With respect names apply.  There may be an increased
    potential to the allowed smuggle escaped US-ASCII-based domain names based on US-ASCII characters, across
    firewalls, although because of the same considerations as uniform syntax principle for URIs,
    such a potential is already existing.

4. Change Log

4.1 Changes from draft-ietf-idn-uri--01 to draft-ietf-idn-uri-02

    Moved change log to back

    Changed to only change URIs; IRI syntax updated directly in Section 2 apply.

As IRI
    draft.

    Removed syntax restriction on %hh in Section 2, all the other US-ASCII part, but made
    clear that restrictions on internationalized to domain names apply.

    Made clear that will be defined by the IDN WG MUST be respected.
Also, before encoding an internationalized escaped domain name names in URIs should only be an IRI,
name preparation
    intermediate representation.

    Gave example of mailto: as already allowing escaped host names.

4.2 Changes from draft-ietf-idn-uri--00 to draft-ietf-idn-uri-01

    Changed requirement for URI/IRI resolvers from MUST to SHOULD be applied. However, the

    Changed IRI resolver SHOULD
also apply name preparation.

It is expected that the rules syntax slightly (ichar -> idchar, based on changes in Section 3.1 of
    [IRI])

    Various wording changes

References

    [IDNA]      Faltstrom, P., Hoffman, P. and A. Costello,
                "Internationalizing Domain Names in Applications (IDNA)",
                draft-ietf-idn-idna-09.txt (work in progress), May 2002,
                <http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-
                09.txt>.

    [IDNWG]     "IETF Internationalized Domain Name (idn) Working Group".

    [IRI] will be less
restrictive than the rules for internationalized domain names, so that
no escaping is necessary. Nevertheless, escaping is allowed       Duerst, M. and M. Suignard, "Internationalized Resource
                Identifiers (IRI)", draft-duerst-iri-01 (work in
                progress), July 2002.

    [ISO10646]  International Organization for cases
where not all characters can be directly represented.

4. Security Considerations

The security considerations of [RFC 2396] Standardization,
                "Information Technology - Universal Multiple-Octet Coded
                Character Set (UCS) - Part 1: Architecture and [IRI] Basic
                Multilingual Plane", ISO Standard 10646-1, October 2000.

    [Nameprep]  Hoffman, P. and those applying
to internationalized domain names apply. There may be an increased
potential M. Blanchet, "Nameprep: A Stringprep
                Profile for Internationalized Domain Names", draft-ietf-
                idn-nameprep-10.txt (work in progress), May 2002, <http:/
                /www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-
                10.txt>.

    [RFC2119]   Bradner, S., "Key words for use in RFCs to smuggle escaped US-ASCII-based domain names across
firewalls, although because Indicate
                Requirement Levels", BCP 14, RFC 2119, March 1997.

    [RFC2141]   Moats, R., "URN Syntax", RFC 2141, May 1997.

    [RFC2192]   Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

    [RFC2277]   Alvestrand, H., "IETF Policy on Character Sets and
                Languages", BCP 18, RFC 2277, January 1998.

    [RFC2279]   Yergeau, F., "UTF-8, a transformation format of ISO
                10646", RFC 2279, January 1998.

    [RFC2384]   Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

    [RFC2396]   Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
                Resource Identifiers (URI): Generic Syntax", RFC 2396,
                August 1998.

    [RFC2640]   Curtin, B., "Internationalization of the uniform syntax principle File Transfer
                Protocol", RFC 2640, July 1999.

    [RFC2718]   Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke,
                "Guidelines for
URIs, such a potential is already existing.

Acknowledgements

Looking forward new URL Schemes", RFC 2718, November
                1999.

    [RFC2732]   Hinden, R., Carpenter, B. and L. Masinter, "Format for comments. Will acknowledge them here!
                Literal IPv6 Addresses in URL's", RFC 2732, December
                1999.

Author's Address

    Martin Duerst
    W3C/Keio University
    5322 Endo
    Fujisawa  252-8520
    Japan

    Phone: +81 466 49 1170
    Fax:   +81 466 49 1171
    EMail: duerst@w3.org
    URI:   http://www.w3.org/People/D%C3%BCrst/

Full Copyright Statement

    Copyright (C) The Internet Society, 1997. Society (2002).  All Rights Reserved.

    This document and translations of it may be copied and furnished to
    others, and derivative works that comment on or otherwise explain it
    or assist in its implementation may be prepared, copied, published
    and distributed, in whole or in part, without restriction of any
    kind, provided that the above copyright notice and this paragraph are
    included on all such copies and derivative works.  However, this
    document itself may not be modified in any way, such as by removing
    the copyright notice or references to the Internet Society or other
    Internet organizations, except as needed for the purpose of
    developing Internet standards in which case the procedures for
    copyrights defined in the Internet Standards process must be
    followed, or as required to translate it into languages other than
    English.

    The limited permissions granted above are perpetual and will not be
    revoked by the Internet Society or its successors or assigns.

    This document and the information contained herein is provided on an
    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

Author's address

          Martin J. Duerst
          W3C/Keio University
          5322 Endo, Fujisawa
          252-8520 Japan
          duerst@w3.org
          http://www.w3.org/People/D%C3%BCrst/
          Tel/Fax: +81 466 49 1170

          Note: Please write "Duerst" with u-umlaut wherever
                possible, e.g. as "D&#252;rst" in XML and HTML.

References

[IDNWG] IETF Internationalized Domain Name (idn) Working Group.
  Information at http://www.ietf.org/html.charters/idn-charter.html.

[IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers
  (IRI)", Internet Draft, November 2001,
  <http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt>,
  work in progress.

[ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet
  Coded Character Set (UCS) - Part 1: Architecture and Basic
  Multilingual Plane, Oct. 2000, with amendments.

[RFC 2119] S. Bradner, "Key words PURPOSE.

Acknowledgement

    Funding for use in RFCs to Indicate
  Requirement Levels", March 1997.

[RFC 2141] R. Moats, "URN Syntax", May 1997.

[RFC 2192] C. Newman, "IMAP URL Scheme", September 1997.

[RFC 2277] H. Alvestrad, "IETF Policy on Character Sets and
  Languages".

[RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646.",
  January 1998.

[RFC 2384] R. Gellens, "POP URL Scheme", August 1998.

[RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform Resource
  Identifiers (URI): Generic Syntax." August 1998.

[RFC 2640] B. Curtis, "Internationalization of the File Transfer
  Protocol", July 1999.

[RFC 2718] L. Masinter, H. Alvestrand, D. Zigmond, R. Petke,
  "Guidelines for new URL Schemes", November 1999.

[RFC 2732] R. Hinden, B. Carpenter, L. Masinter, "Format for Literal
  IPv6 Addresses in URL's", December 1999. RFC Editor function is currently provided by the
    Internet Society.