[Docs] [txt|pdf] [Tracker] [Email] [Diff1] [Diff2] [Nits]

Versions: 00 01 02 03 04 05 06 07 08 RFC 4290

Network Working Group                                         J. Klensin
Internet-Draft                                            April 30, 2004
Expires: October 29, 2004

  Registration of Internationalized Domain Names: Overview and Method

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at http://

   The list of Internet-Draft Shadow Directories can be accessed at

   This Internet-Draft will expire on October 29, 2004.

Copyright Notice

   Copyright (C) The Internet Society (2004). All Rights Reserved.


   IETF has introduced standards-track mechanisms to enable the use of
   "internationalized", i.e., non-ASCII, names in the DNS and
   applications that use it.  This has led, in turn, to concerns that
   characters with similar meanings or appearances could cause user
   confusion and opportunities for deliberate deception and fraud.  Part
   of this problem can be addressed by limiting, on a per-zone (or
   per-registry) basis, the specific characters that can be used to be a
   subset of the list allowed by the standard and by creating
   "reservations" of labels that might create confusion with those that
   are permitted.  The model for doing this for languages that use
   characters that originated with Chinese has been extensively
   developed in another document.  This document discusses some of the
   issues in that design and relates them to considerations and

Klensin                 Expires October 29, 2004                [Page 1]

Internet-Draft              IDN Registration                  April 2004

   mechanisms that might be appropriate for other languages and scripts,
   especially those involving alphabetic characters.

   In particular, it describes some suggested practices for registering
   internationalized domain names (IDNs) in a zone. Before accepting
   such registrations of domain names, the zone's registry should decide
   which codepoints in the Unicode character set the zone will accept.
   The registry should also decide whether particular characters in a
   registered domain name should cause action with regard to other
   domain names which are considered equivalent; these domain names
   might be added to the zone or blocked from registration. This
   document also describes the concept of character variants for
   registering IDNs, how they might be handled in the registration
   process, and how to publish tables that list the character variants.

   This document is intended to supply a basis for adapting methods
   developed for Chinese, Japanese, and Korean to other languages and
   scripts.  If these adaptations are made carefully and with due
   consideration for local issues, the likelihood of problematic DNS
   registrations with be significantly reduced.  A specific method is
   introduced that should be applicable (directly, or with minor
   modifications), to many scripts.

Klensin                 Expires October 29, 2004                [Page 2]

Internet-Draft              IDN Registration                  April 2004

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
     1.1   Background . . . . . . . . . . . . . . . . . . . . . . . .  4
     1.2   Terminology  . . . . . . . . . . . . . . . . . . . . . . .  5
       1.2.1   Characters, Variants, Registrations, and Other
               Issues . . . . . . . . . . . . . . . . . . . . . . . .  5
       1.2.2   Confusion, Fraud, and Cybersquatting . . . . . . . . .  6
     1.3   A Review of the JET Guidelines . . . . . . . . . . . . . .  7
       1.3.1   JET Model  . . . . . . . . . . . . . . . . . . . . . .  7
       1.3.2   Reserved Names and Label Packages  . . . . . . . . . .  8
     1.4   Languages, Scripts, and Variants . . . . . . . . . . . . .  8
       1.4.1   Languages and Scripts  . . . . . . . . . . . . . . . .  8
       1.4.2   Variant Selection  . . . . . . . . . . . . . . . . . . 10
     1.5   Reservations and Exclusions  . . . . . . . . . . . . . . . 11
       1.5.1   Sequence Exclusions for Valid Characters . . . . . . . 11
       1.5.2   Character Pairing Issues . . . . . . . . . . . . . . . 12
     1.6   The Registration Bundle  . . . . . . . . . . . . . . . . . 12
       1.6.1   Definitions and Structure  . . . . . . . . . . . . . . 12
       1.6.2   Application of the Registration Bundle . . . . . . . . 12
   2.  Some Implications of This Approach . . . . . . . . . . . . . . 14
   3.  Required Modifications to JET Model Needed Under Some of
       the Models Above . . . . . . . . . . . . . . . . . . . . . . . 14
   4.  Conclusions and Recommendations About the General Approach . . 15
   5.  A Model Table Format . . . . . . . . . . . . . . . . . . . . . 16
   6.  A Model Label Registration Procedure: "CreateBundle" . . . . . 17
     6.1   Description of the CreateBundle Mechanism  . . . . . . . . 17
   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 18
   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 19
       Author's Address . . . . . . . . . . . . . . . . . . . . . . . 21
       Intellectual Property and Copyright Statements . . . . . . . . 22

Klensin                 Expires October 29, 2004                [Page 3]

Internet-Draft              IDN Registration                  April 2004

1.  Introduction

1.1  Background

   Once work on the basic model for encoding non-ASCII strings in the
   DNS with IDNA ([RFC3490], [RFC3491], [RFC3492]) was nearing
   completion, it became clear that it would be desirable for registries
   to impose additional restrictions on the names that could actually be
   registered (e.g., see [IESG-IDN] and [ICANN-IDN]) as a means of
   reducing potential confusion among characters that were similar in
   some way.  These restrictions were, in many respects, part of a long
   tradition.  For example, while the original DNS specifications
   [RFC1035] permitted any string of octets to be used in a DNS label,
   they also recommended the use of a much more restricted subset, one
   that was derived from the much older "hostname" rules [RFC0952] and
   defined by the "LDH" convention (for the three permitted types of
   characters, letters, digits, and the hyphen). Enforcement of those
   restricted rules in registrations was the responsibility of the
   registry or domain administrator.  They were not embedded in the DNS
   protocol itself, although some applications protocols, notably those
   concerned with electronic mail, did impose and then enforce similar

   If there are no constraints on registration in a zone, people can
   register characters that increase the risk of misunderstandings,
   cybersquatting, and other forms of confusion. That a similar
   situation existed even before the introduction of IDNA is exemplified
   by domain names such as example.com and examp1e.com (note that the
   latter domain contains the digit "1" instead of the letter "l").

   For non-ASCII names (so-called "internationalized domain names" or
   "IDNs"), the problem was more complicated than that which led to the
   LDH (hostname) rules.  In the earlier situation, all protocols,
   hosts, and DNS zones used ASCII exclusively in practice, so the LDH
   restriction could reasonably be applied uniformly across the
   Internet. With the introduction of a very large character repertoire,
   and with different geographical and political locations and languages
   having requirements for different collections of characters, the
   optimal registration restrictions became, not a global matter, but
   ones that were different in different areas and, hence, in different
   DNS zones.

   For some human languages, there are characters and/or strings that
   have equivalent or near-equivalent usages. If someone is allowed to
   register a name with such a character or string, the registry might
   want to automatically associate all of the names that have the same
   meaning with the registered name. The registry might also decide
   whether the names that are associated with, or generated by, one

Klensin                 Expires October 29, 2004                [Page 4]

Internet-Draft              IDN Registration                  April 2004

   registration should, as a group or individually, go into the zone or
   be blocked from registration by different parties.

   To date, the best-developed system for handling registration
   restrictions for IDNs is the JET Guidelines for Chinese, Japanese,
   and Korean [RFC3743], the so-called "CJK" languages.  That system is
   limited to those languages and, in particular, to their common script
   base.   This document explores the principles behind those guidelines
   and some of the issues that might arise in trying to adapt them to
   alphabetic languages.

   This document describes five things:

   1.  The general background and considerations for non-ASCII scripts
       in names.   Just as the JET Guidelines contain some suggestions
       that may not be applicable to alphabetic scripts, some of the
       suggestions here, especially the more specific ones, may be
       applicable to some scripts and not others
   2.  Suggested practices for describing character variants
   3.  A method for using a zone's character variants to determine which
       names should be associated with a registration
   4.  A format for publishing a zone's table of character variants
   5.  A model algorithm for name registration given the presence of
       language tables.

1.2  Terminology

1.2.1  Characters, Variants, Registrations, and Other Issues

   1.  Characters in this document are given as their Unicode codepoints
       in U+xxxx format, with their official names, or both.

   2.  The following terms are used in this document.

       1.   A "string" is an sequence of one or more characters.
       2.   This document discusses characters that may have equivalent
            or near-equivalent characters or strings. The "base
            character" is the character that has zero or more
            equivalents.  In the JET Guidelines, base characters are
            referred to as "valid characters".
       3.   The "variant(s)" are the character(s) and/or string(s) that
            are equivalent to the base character.  Note that these might
            not be true equivalent characters: a base character might
            have a mapping to a particular variant character, but that
            variant character may not have a mapping to the base
            character.  Usually, characters or strings to be designated
            as variants are considered either equivalent or sufficiently
            similar (by some registry-specific definition) that

Klensin                 Expires October 29, 2004                [Page 5]

Internet-Draft              IDN Registration                  April 2004

            confusion between them and the base character might occur.
       4.   The "base registration" is the single name that the
            registrant requested from the registry.
       5.   A label (or "name") is described as "registered" if it is
            actually entered into a domain (i.e., a zone file) by the
            registry, so that it can be accessed and resolved using
            standard DNS tools.  The JET Guidelines describe a
            "registered" label as "activated".
       6.   A "registration bundle" is the set of all labels that comes
            from expanding the base characters for a single name into
            their variants.  The presence of a label in a registration
            bundle does not imply that it is registered.  In the JET
            Guidelines, a registration bundle is called an "IDN
       7.   A "reserved label" is a label in a registration bundle that
            is not actually registered.
       8.   A "registry" is the administrative authority for a DNS zone.
            That is, the registry is the body that enforces, and
            typically makes, policies that are used in a particular zone
            in the DNS.
       9.   "Coded Character Set" ("CCS") is a term for a list of
            characters and the code positions assigned to them.  ASCII
            and Unicode are CCSs.
       10.  A "language" is something spoken by humans, independent of
            how it is written or coded.  ISO Standard 639 and IETF BCP
            47 (RFC 3066) [RFC3066] list and define codes for
            identifying languages.
       11.  A "script" is a collection of characters (glyphs,
            independent of coding) that are used together, typically to
            represent one or more languages. Note that the script for
            one language may heavily overlap the script for another.
            This does not imply that they have identical scripts.
       12.  "Charset" is an IETF-invented term to describe, more or
            less, the combination of a script, a CCS that encodes that
            script, and rules for serializing the bytes when those are
            stored on a computer or transmitted over the network.

   The last four of these definitions are redundant with, but
   deliberately somewhat less precise than, the definitions in
   [RFC3536], which also provides sources.  The two sets of definitions
   are intended to be consistent.

1.2.2  Confusion, Fraud, and Cybersquatting

   The term "confusion" is used very generically in this document to
   cover the entire range from accidental user misperception of the
   relationship between characters with some characteristic in common
   (typically appearance, sound, or meaning) to cybersquatting and

Klensin                 Expires October 29, 2004                [Page 6]

Internet-Draft              IDN Registration                  April 2004

   [other] deliberate fraudulent attempts to exploit those relationships
   or others based on the nature of the characters.

1.3  A Review of the JET Guidelines

1.3.1  JET Model

   In the JET Guidelines model, a prospective registrant approaches the
   registry for a zone (perhaps through an intermediate registrar) with
   a candidate base registration -- a proposed name to be registered --
   and a list of languages in which that name is to be interpreted.  The
   languages are defined according to the fairly high-resolution coding
   of [RFC3066] -- Chinese as used on the mainland of the People's
   Republic of China ("zh-cn") can, at registry option, consist of a
   somewhat different list of characters (code points) and be
   represented by a separate table compared to Chinese as used in Taiwan

   The design of the JET Guidelines took one important constraint as a
   basis: IDNA was treated as a firm standard.  A procedure that
   modified some portion of the IDNA functions, or was a variant on
   them, was considered a violation of those standards and should not be
   encouraged (or, probably, even permitted).

   Each registry is expected to construct (or obtain) a table for each
   language it considers relevant and appropriate.  These tables list,
   for the particular zone, the characters permitted for that language.
   If a character does not appear as a "valid code point" (called a
   "base character" in the rest of this document) in that table, then a
   name containing it cannot be registered.  If multiple languages are
   listed for the registration, then the character must appear in the
   tables for each of those languages.

   The tables may also contain columns that specify alternate or variant
   forms of the valid character.  If these variants appear, they are
   used to synthesize labels that are alternatives to the original one.
   These labels are all reserved and can be registered or "activated"
   (placed into the DNS) only by the action or request of the original
   registrant; some (the "preferred variant labels") are typically
   registered automatically.  The zone is expected to establish
   appropriate policies for situations in which the variant forms of one
   label conflict with already-reserved or already-registered labels.

   Most of these concepts were introduced because of concerns about
   specific issues with CJK characters, beginning from the requirement
   that the use of Simplified Chinese by some registrants and
   Traditional Chinese by others not be permitted to create confusion or
   opportunities for fraud.  While they may be applicable to registry

Klensin                 Expires October 29, 2004                [Page 7]

Internet-Draft              IDN Registration                  April 2004

   tables contructed for alphabetic scripts, the translation should be
   done with care, since many analogies are not exact.

   Some of the important issues are discussed in the sections that
   follow. The JET model may be considered as a specialized variation on
   the model and method presented by the rest of this document.  Other
   languages or scripts may require other variations.

1.3.2  Reserved Names and Label Packages

   A basic assumption of the JET model is that, if the evolution of
   specific characters or the properties of Unicode ([Unicode],
   [Unicode32]) or IDNA cause two strings to appear similar enough to
   cause confusion, either or both should be registered by the same
   party or one of them should become unregisterable.  The definition of
   "appear similar enough" will differ for different cultures and
   circumstances -- and hence DNS zones -- but the principle is fairly
   general.  In the JET model, all of the "variant" strings are
   identified, some are registered into the DNS automatically, and
   others are simply reserved and can be registered, if at all, only by
   the original registrant.  Other zones might find other policies
   appropriate.  For example, a zone might conclude that having similar
   strings registered in the DNS was undesirable.  If so, the list of
   variant labels would be used only to build a list of names that would
   be reserved and prohibited from being registered.

1.4  Languages, Scripts, and Variants

1.4.1  Languages and Scripts

   Conversations about scripts -- collections of characters associated
   with particular languages -- are common when discussing character
   sets and codes.   But the boundaries between one script and another
   are not well-defined.  The Unicode Standard ([Unicode][Unicode32]),
   for example, does not define them at all, even though it is
   structured in terms of usually-related blocks of characters.  The
   issue is complicated by the common origin of most alphabetic scripts
   (see, for example, [Drucker]). Because of that history, certain
   character-symbols appear in the scripts associated with multiple
   languages, sometimes with very different sounds or meanings.  This
   differs from the CJK situation in which, if a character appears in
   more than one of the relevant languages, it will almost always have
   the same interpretation in each one. For the subset of characters
   that actually are ideographs or pictographs, pronunciation is
   expected to vary widely while meaning is preserved. At least in part
   because of that similarity of meaning, it made sense in the JET case
   to permit a registration to speciy multiple languages, to verify that
   the characters in the label string were valid for each, and then to

Klensin                 Expires October 29, 2004                [Page 8]

Internet-Draft              IDN Registration                  April 2004

   generate variant labels using each language in turn.  For many
   alphabetic languages, it may be more sensible to prohibit the label
   string submitted for registration from being associated with more
   than one language.  Indeed, "one label, one language" has been
   suggested as an important barrier against common sources of
   "look-alike" confusion. For example, the imposition of that rule in a
   zone would prevent the insertion of a few Greek or Cyrillic
   characters with shapes identical to the Latin ones into what was
   otherwise a Latin-based string.  For a particular table, the list of
   valid characters may be thought of as the script associated with the
   relevant language, with the understanding that the table design does
   not prevent the same character from appearing in the tables for
   multiple languages.

   Indeed, this notion of a locally, and specifically-identified, script
   can be turned around: while the tables are referred to as "language
   tables", they are associated with languages only insofar as thinking
   about the character structure and word forms associated with a given
   language helps to inform the construction of a table.  A country like
   Finland, for example, might select among

   o  One table each for Finnish, Swedish, and English characters and
      conventions, permitting a string to be registered in one, two, or
      all three languages (although a three-language registration would
      presumably prohibit any characters that did not appear in all
      three languages).
   o  One table each, but with a "one label, one language" rule for the
   o  A combined table based on the observation that all three writing
      systems were based on Roman characters and that the possibilities
      for confusion that were of interest to the registry would not be
      reduced by "language" differentiation.

   Regardless of what decisions were made about those languages and
   scripts, if they also decided to permit registrations of labels
   containing Cyrillic characters, they might have a separate table for
   them.  That table might contain some Roman-derived characters (either
   as base characters or as variants) just as some CJK tables do.  See
   also Section 2, below.

   As the JET Guidelines stress, no tables or systems of this type --
   even if identified with a language as a means of defining or
   describing the table -- can assure linguistic or even syntactic
   correctness of labels with regard to that language. That level of
   assurance may not be possible without human intervention or at least
   dictionary lookups of complete proposed labels.   It may even not be
   desirable to attempt that level of correctness (see Section 2).

Klensin                 Expires October 29, 2004                [Page 9]

Internet-Draft              IDN Registration                  April 2004

   Of course, if any language-based tests or constraints, including "one
   label, one language", are to be applied to limit the associated
   sources of confusion, each zone must have a table for each language
   in which it expects to accept registrations; the notion of a single
   combined table for the zone is, in the general case, simply
   unworkable.  One could use a single table for the zone if the intent
   were to impose only minimal restrictions, e.g., to force alphabetic
   and numeric characters only and exclude symbols and punctuation.
   That type of restriction might be useful in eliminating some
   problems, such as those of unreadable labels, but would be unlikely
   to be very helpful with, e.g., confusion caused by similar-looking

1.4.2  Variant Selection

   The area of character variants is rife with difficulties (and perhaps
   opportunities). There is no universal agreement about which base
   characters have variants, or if they do, what those variants are. For
   example, in some regions of the world and in some languages, LATIN
   STROKE (U+00F8) are variants of each other, while in other regions,
   most people would think that LATIN SMALL LETTER O WITH STROKE has no
   variants.  In some cases, the list of variants is difficult to
   enumerate. For example, it required several years for the Chinese
   language community to create variant tables for use with IDNA, and it
   remains, at the time of this writing, questionable how widely those
   tables will be accepted among users of Chinese from areas of the
   world other than those represented by the groups that created them.

   Thus, the first thing a registry should ask is whether or not any of
   the characters that they want to permit to be used have variants. If
   not, the registry's work is much simpler. This is not to say that a
   registry should ignore variants if they exist: adding variants after
   a registry has started to take registrations will be nearly as
   difficult administratively as removing characters from the list of
   acceptable characters. That is, if a registry later decides that two
   characters are variants of each other, and there are actively-used
   names in the zones that differ only on the new variants, the registry
   might have to transfer ownership of one of the names to a different
   owner, using some process that is certain to be controversial.

   This situation in likely to be much easier for areas and zones that
   use characters that previously did not occur in the DNS at all than
   it will be for zones in which non-English labels have been registered
   in ASCII characters for some time.    In the former case, the rules
   and conventions can be established before any registrations occur.
   In the latter, there may be conflicts or opportunities for confusion
   between existing registrations and now-permitted Roman-based

Klensin                 Expires October 29, 2004               [Page 10]

Internet-Draft              IDN Registration                  April 2004

   characters that do not appear in ASCII.  For example, a domain name
   might exist today that uses the name of a city in Canada spelled as
   "Montreal".  If the zone in which it occurs changes its rules to
   permit the use of the character LATIN SMALL LETTER E WITH ACUTE
   (U+00E9), does the name of the city, spelled (correctly) using that
   character, conflict with the existing domain name registration?
   Certainly, if both are permitted, and permitted to be registered by
   separate parties, there are many opportunities for confusion.

   Of course, zone managers should inform all current registrants when
   the registration policy for the zone changes. This includes the time
   at which IDN characters are allowed in the zone the first time, when
   additional characters are permitted later, and, if it is necessary to
   change character variant tables, when that occurs.

   In many languages there are two variants for a character, but one
   variant is strongly preferred. A registry might only allow the base
   registration in the preferred form, or it might allow any form for
   the base registration. If the variant tables are created carefully,
   the resulting bundles will be the same, but some registries will give
   special status to the base registration such as its appearance in
   "Whois" databases.

1.5  Reservations and Exclusions

1.5.1  Sequence Exclusions for Valid Characters

   The JET Guidelines are based on processing only single characters.
   Any processing of pairs or longer sequences of characters are left to
   what that document describes as "additional processing" -- procedures
   specifically permitted by the Guidelines but defined by a registry in
   addition to the variant table processing specified in the Guidelines
   themselves.  A different zone, with different needs, could use a
   modified version of the table structure, or different types of
   additional processing, to prohibit, as well as accept, particular
   sequences of characters by marking them as invalid.  Other
   modifications or extensions might be designed to prevent certain
   letters from appearing at the beginning or end of labels.  The use of
   regular expressions  in the "valid characters" column might be one
   way to implement these types of restrictions.

   In particular, in some scripts derived from Roman characters,
   sequences that have historically been typographically represented by
   single "ligature" or "digraph" characters may also be represented by
   the separate characters (e.g., "ae" (U+00E6) or "ij" (U+0133)).  If
   it is desired to either prohibit these, or to treat them as variants,
   some extensions to the single-character JET model may be needed (as
   may be some careful thinking about IDNA (especially nameprep), since

Klensin                 Expires October 29, 2004               [Page 11]

Internet-Draft              IDN Registration                  April 2004

   some of these combinations are excluded there).

1.5.2  Character Pairing Issues

   Some character pairings -- the use of a character form (glyph) in one
   language and a different form with the same properties in a related
   one -- closely approximate the issues with mapping between
   Traditional and Simplified Chinese although the history is different.
   For example, it might be useful to have "o" with a stroke (U+00F8) as
   a variant for "o" with diaeresis above it (U+00F6) (and the
   equivalent upper-case pair) in a Swedish table, and vice versa in a
   Norwegian one, or to prohibit one of these characters entirely in
   each table. In a German table, U+00F8 would presumably be prohibited,
   while U+00F6 might have "oe" as a variant. Obviously, if the relevant
   language of registration is unknown, this type of variant matching
   cannot be applied in any sensible way.

1.6  The Registration Bundle

1.6.1  Definitions and Structure

   As one of its critical innovations, the JET model defines an "IDN
   package", known in this document as a "registration bundle", which
   consists of the primary registered string (which is used as the name
   of the bundle), the information about the language table(s) used, the
   variant labels for that string, and indications of which of those
   labels are registered in the relevant zone file ("activated" in the
   JET terminology).  Registration bundles are also atomic -- one can
   not add or remove variant labels from one without unregistering the
   entire package.  A label exists in only one registration bundle at a
   time; if a new label is registered that would generate a variant that
   matches one that appears in an existing package, that variant simply
   is not included in the second package.  A subsequent deregistration
   of the first package does not cause the variant to be added to the
   second. While it might be possible to change this in other models,
   the JET conclusion was that other options would be far too complex to
   implement and operate and would cause many new types of name

1.6.2  Application of the Registration Bundle

   A registry has three options for how to handle the case where the
   registration bundle is non-trivial, i.e., contains more than one
   label. The policy options are:

   1.  Register and resolve all labels in the zone, making the zone
       information identical to that of the registered label. This
       option will cause end users to be able to find names with

Klensin                 Expires October 29, 2004               [Page 12]

Internet-Draft              IDN Registration                  April 2004

       variants more easily, but will result in larger zone files. For
       some language tables, the zone file could become so large that it
       could negatively affect the ability of the registry to perform
       name resolution. If the base registration contains several
       characters that have equivalents, the owner could end up having
       to take care of large numbers of zones. For instance, if DIGIT
       ONE is a variant of LATIN SMALL LETTER L, the owner of the domain
       name all-lollypops.example.com will have to manage 32 zones.   If
       the intent is to keep the contents of those zones identical, the
       owner may then face a significant administrative problem.  If
       other concerns dictate short times to live and absolute
       consistency of DNS responses, the challenges may be nearly
   2.  Block all labels other than the registered label so they cannot
       be registered in the future. This option does not increase the
       size of the zone file and provides maximum safety against false
       positives, but it may cause end users to not be able to find
       names with variants that they would expect. If the base
       registration contains characters that have equivalents, Internet
       users who don't know what base characters were used in the
       registration will not know what character to type in to get a DNS
       response. For instance, if DIGIT ONE is a variant of LATIN SMALL
       LETTER L, and LATIN SMALL LETTER L is a variant of DIGIT ONE, the
       user who sees "pale.example.com" will no know whether to type a
       "1" or a "l" after the "pa" in the first label.
   3.  Resolve some labels and block some other labels. This option is
       likely to cause the most confusion with users because including
       some variants will cause a name to be found, but using other
       variants will cause the name to be not found. For example, even
       if people understood that DIGIT ONE and LATIN SMALL LETTER L were
       variants, a typical DNS user wouldn't know which character to
       type because they wouldn't know whether this pair were used to
       register or block the labels. However, this option can be used to
       balance the desires of the name owner (that every possible
       attempt to enter their name will work) with the desires of the
       zone administrator (to make the zone more manageable and possibly
       to be compensated for greater amounts of work needed for a single
       registration). For many circumstances, it may be the most
       attractive option.

   In all cases, at least the registered label should appear in the
   zone. It would be almost impossible to describe to name owners why
   the name that they asked for is not in the zone, but some other name
   that they now control is.   By implication, if the requested label is
   already registered, the entire registration request must be rejected.

Klensin                 Expires October 29, 2004               [Page 13]

Internet-Draft              IDN Registration                  April 2004

2.  Some Implications of This Approach

   Historically, DNS labels were considered to be arbitrary identifier
   strings, without any inherent meaning.  Even in ASCII, there was no
   requirement that labels form words.  Labels that could not possibly
   represent words in any Romance or Germanic language have actually
   been quite common.  In general, in those languages, words contain at
   least one vowel and do not have embedded numbers. The more one moves
   toward "language"-based registry restrictions, the less it is going
   to be possible to construct labels out of fanciful strings. While
   such strings are terrible candidates for "words", they may make very
   good identifiers.  To take a trivial example using only ASCII
   characters, "rtr32w", "rtr32x", and "rtr32z" might be very good DNS
   labels for a particular zone and application, but, given the embedded
   digits and lack of vowels, would fail even the most superficial of
   tests for valid English word forms.  It is worth noting that several
   DNS experts have suggested that a number of problems could be solved
   by prohibiting meaningful names in labels, requiring instead that the
   labels be random or nonsense strings.  If methods similar to those
   discussed in this document were used to force identifiers to be
   closer to meaningful words in real languages, the result would be
   directly contradictory to those "random name" approaches.

   Interestingly, if one were trying to develop an "only words" system,
   a rather different -- but very restrictive -- model could be
   developed using lookups in a dictionary for the relevant language and
   a listing of valid business names for the relevant area.   If a
   string did not appear in either, it would not be permitted to be
   registered.  Models effectively equivalent to this one have
   historically been used to restrict registrations in some country-code
   top level domains. On the other hand, if look-alike characters are a
   concern, even that type of rule (or restriction) would still not
   avoid the need to consider character variants.

   Consequently, registries applying the principles outlined in this
   document should be careful not to apply more severe restrictions than
   are reasonable and appropriate while, at the same time, being aware
   of how difficult it usually is to add restrictions at a later time.

3.  Required Modifications to JET Model Needed Under Some of the Models

   The JET model was designed for CJK characters.  The discussion above
   implies that some extensions to it may be needed to handle the
   characteristics of various alphabetic scripts and the decisions that
   might be made about them in different zones.  Those extensions might
   include facilities to process:

Klensin                 Expires October 29, 2004               [Page 14]

Internet-Draft              IDN Registration                  April 2004

   o  Two-character (or more) sequences, such as ligatures and
      typographic spelling conventions, as variants.
   o  Regular expressions or some other mechanism for dealing with
      string positions of characters (e.g., characters that must, or
      must not, appear at the beginning or end of strings).
   o  Delimiter breaks to permit multiple languages to be used,
      separately, within the same label.  E.g., is it possible to define
      a label as consisting of two or more sublabels, each in a
      different language,  with some particular delimiter used to define
      the boundaries of the sublabels?

4.  Conclusions and Recommendations About the General Approach

   After examining the implications of the potential use of the full
   range of characters permitted by IDNA in DNS labels, multiple groups
   have concluded that some restrictions are needed to prevent many
   forms of user confusion about the actual structure of a name or the
   word, phrase, or term that it appears to spell out.  The best way to
   approach such restrictions appears to draw from the language and
   culture of the community of registrants and users in the relevant
   zone: if particular characters are likely to be surprising or
   unintelligible to both of those groups, it is probably wise to not
   permit them to be used in registrations. Registration restrictions
   can be carried much further than restricting permitted characters to
   a selected Unicode subset.  The idea of a reserved "bundle" of
   related labels permits probably-confusing combinations or sets of
   characters to be bound together, under the control of a single
   registrant.  While that registrant might still use the package in a
   way that confused his or her own users (the approach outlined here
   will not prevent either ill-though-out ideas or stupidity), the
   possibility of turning potential confusion into a hostile attack
   would be considerably reduced.

   At the same time, excessive restrictions may make DNS identifiers
   less useful for their original, intended, purpose: identifying
   particular hosts and similar resources on the network in an orderly
   way. Registries creating rules and policies about what can be
   registered in particular zones -- whether those are based on the JET
   Guidelines or the suggestions in this document -- should balance the
   need for restrictions against the need for flexibility in
   constructing identifiers.

   The discussion above provides many options that could be selected,
   defined, and applied in different ways in different registries
   (zones).  Registrars would almost certainly prefer systems in which
   they can predict, at least to a first order approximation, the
   implications of a particular potential registration to ones in which
   they cannot.  Predictability of that sort probably requires more

Klensin                 Expires October 29, 2004               [Page 15]

Internet-Draft              IDN Registration                  April 2004

   standards, and less flexibility, than the model itself might suggest.

5.  A Model Table Format

   The format of the table is meant to be machine-readable but not
   human-readable. It is fairly trivial to convert the table into one
   that can be read by people.

   Each character in the table is given in the "U+" notation for Unicode
   characters. The lines of the table are terminated with either a
   carriage return character (ASCII 0x0D), a linefeed character (ASCII
   0x0A), or a sequence of carriage return followed by linefeed (ASCII
   0x0D 0x0A). The order of the lines in the table may or may not
   matter, depending on how the table is constructed.

   Comment lines in the table are preceded with a "#" character (ASCII

   Each non-comment line in the table starts with the character that is
   allowed in the registry and expected to be used in registrations,
   which is also called the "base character". If the base character has
   any variants, the base character is followed by a vertical bar
   character ("|", ASCII 0x7C) and the variant string. If the base
   character has more than one variant, the variants are separated by a
   colon (":", ASCII 0x3A). Strings are given with a hyphen ("-", ASCII
   0x2D) between each character. Comments beginning with a "#" (ASCII
   0x2C), and may be preceded by spaces (" ", ASCII 0x20).

   The following is an example of how a table might look. The entries in
   this table are purposely silly and should not be used by any registry
   as the basis for choosing variants. For the example, assume that the

   o  allows the FOR ALL character (U+2200) with no variants
   o  allows the COMPLEMENT character (U+2201) which has a single
      variant of LATIN CAPITAL LETTER C (U+0043)
   o  allows the PROPORTION character (U+2237) which has one variant
      which is the string COLON (U+003A) COLON (U+003A)
   o  allows the PARTIAL DIFFERENTIAL character (U+2202) which has two
      variants: LATIN SMALL LETTER D (U+0064) and GREEK SMALL LETTER
      DELTA (U+03B4)

   The table contents (after any required header information, see
   [IANA-language-registry]) would look like:
   # An example of a table

Klensin                 Expires October 29, 2004               [Page 16]

Internet-Draft              IDN Registration                  April 2004

   U+2237|U+003A-U+003A # Note that the variant is a string
   U+2202|U+0064:U+03B4 # Two variants for the same character

   Implementers of table processors should remember that there are tens
   of thousands of characters whose codepoints are greater than 0xFFFF.
   Thus, any program that assumes that each character in the table is
   represented in exactly six octets ("U", "+", and four octets
   representing the character value) will fail with tables that use
   characters whose value is greater than 0xFFFF.

6.  A Model Label Registration Procedure: "CreateBundle"

   This procedure has three inputs:

   1.  the proposed base registration
   2.  the language for the proposed base registration
   3.  the processing table associated with that language

   The output of the process is either failure (the base registration
   cannot be registered at all), or a registration bundle that contains
   one or more labels ( always including the base registration). As
   described earlier, the registration bundle should be stored with its
   date of creation so that issues with overlapping elements between
   bundles can later be resolved on a first-come, first-served basis.

   There are two steps to processing the registration:

   1.  Check whether the proposed base registration exists in any
       bundle. If it does, stop immediately with a failure.
   2.  Process the base registration with the mechanism described below.

   Note that the process must be executed only once. The process must
   not be performed on any output of the process, only on the proposed
   base registration.

6.1  Description of the CreateBundle Mechanism

   The CreateBundle mechanism determines whether a registration bundle
   can be created and, if so, populates that bundle with valid labels.

   During the processing, an "temporary bundle" contains partial labels,
   that is, labels that are being built and are not complete labels. The
   partial labels in the temporary bundle consist of strings.

   The steps are:

Klensin                 Expires October 29, 2004               [Page 17]

Internet-Draft              IDN Registration                  April 2004

   1.  Split the base registration into individual characters, called
       "candidate characters". Compare every candidate character against
       the base characters in the table. If any candidate character does
       not exist in the set of base characters, the system must stop and
       not register any names (that is, it must not register either the
       base registration or any labels that would have come from
       character variants).
   2.  Perform the steps in IDNA's ToASCII sequence for the base
       registration. If ToASCII fails for the base registration, the
       system must stop and not register any label (that is, it must not
       register either the base registration or labels that might have
       been created from variants of characters contained in it). If
       ToASCII succeeds, place the base registration into the
       registration bundle.
   3.  For every candidate character in the base registration, do the
       1.  Create the set of characters that consists of the candidate
           character and any variants.
       2.  For each character in the set from the previous step,
           duplicate the temporary bundle that resulted from the
           previous candidate character, and add the new character to
           the end of each partial label.
   4.  The temporary bundle now contains zero or more labels that
       consist of Unicode characters. For every label in the temporary
       bundle, do the following:

       Process the label with ToASCII to see if ToASCII succeeds. If it
       does, add the label to the registration bundle. Otherwise, do not
       process this label from the temporary bundle any further; it will
       not go into the registration bundle.

   The result of the processing outlined above is the registration
   bundle with the base registration and possibly other labels.

7.  Security Considerations

   Registration of labels in the DNS that contain essentially
   unrestricted sequences of arbitrary Unicode characters may introduce
   several opportunities for either attacks or simple confusion.  Some
   of these risks, such as confusion about which character (of several
   that look alike) is actually intended, may be associated with the
   presentation form of DNS names.  Others may be linked to databases
   associated with the DNS, e.g., with the difficulty of finding an
   entry in a "Whois file" when it is not clear how to enter, or search
   for, the characters that make up a name.  This document discusses a
   family of restrictions on the names that can be registered.
   Restrictions of the type described can be imposed by a DNS zone

Klensin                 Expires October 29, 2004               [Page 18]

Internet-Draft              IDN Registration                  April 2004

   ("registry").  The document also describes some possible tools for
   implementing such restrictions.

   While the increased number and types of character made available by
   Unicode considerably increases the scale of the potential problems,
   the problems addressed by this document are not new. No plausible set
   of restrictions will eliminate all problems and sources of confusion:
   for example, it has often been pointed out that, even in ASCII, the
   characters digit-one ("1") and lower case L ("l") can easily be
   confused in some display fonts.   But, to the degree to which
   security may be aided by sensible risk reduction, these techniques
   may be helpful.

8.  Acknowledgements

   Discussions in the process of developing the JET Guidelines were
   vital in developing this document and all of the JET participants are
   consequently acknowledged.  Attempts to explain some of the issues
   there to, and feedback from, Vint Cerf, Wendy Rickard, and members of
   the ICANN IDN Committee were also helpful in the thinking leading up
   to this document.

   An effort by Paul Hoffman to create a generic specification for
   registration restrictions of this type helped to inspire this
   document, which takes a somewhat different, more language-oriented,
   approach than his initial draft.  While the initial version of that
   draft indicated that multiple languages (or multiple language tables)
   for a single zone were infeasible, more recent versions
   [I-D.hoffman-registration] shifted to inclusion of language-based
   approaches.  The current version of this document incorporates
   considerable text, and even more ideas, from those drafts, with Paul
   Hoffman's generous permission.

   The opinions expressed here are, of course, the sole responsibility
   of the author. Some of those whose ideas are reflected in this
   document may disagree with the conclusions the author has drawn from

9  References

   [RFC3490]  Faltstrom, P., Hoffman, P. and A. Costello,
              "Internationalizing Domain Names in Applications (IDNA)",
              RFC 3490, March 2003.

   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
              Profile for Internationalized Domain Names (IDN)", RFC
              3491, March 2003.

Klensin                 Expires October 29, 2004               [Page 19]

Internet-Draft              IDN Registration                  April 2004

   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
              for Internationalized Domain Names in Applications
              (IDNA)", RFC 3492, March 2003.

   [RFC1035]  Mockapetris, P., "Domain names - implementation and
              specification", RFC 1035, STD 13, November 1987.

   [RFC3743]  Konishi, K., Huang, K., Qian, H. and Y. Ko, "Joint
              Engineering Team (JET) Guidelines for Internationalized
              Domain Names (IDN) Registration and Administration for
              Chinese, Japanese, and Korean", RFC 3743, April 2004.

              Internet Engineering Steering Group, IETF, "IESG Statement
              on IDN", IESG Statement IDNstatement.txt, February 2003.

              Internet Corporation for Assigned Names and Numbers,
              "Guidelines for the Implementation of Internationalized
              Domain Names, Version 1.0", June 2003.

              Internet Assigned Numbers Authority, "IDN Language Table
              Registry", April 2004.

   [RFC0952]  Harrenstien, K., Stahl, M. and E. Feinler, "DoD Internet
              host table specification", RFC 952, October 1985.

   [RFC3066]  Alvestrand, H., "Tags for the Identification of
              Languages", BCP 47, RFC 3066, January 2001.

   [Unicode]  The Unicode Consortium, "The Unicode Standard -- Version
              3.0", January 2000.

              The Unicode Consortium, "Unicode Standard Annex #28",
              March 2002.

   [Drucker]  Drucker, J., "The Alphabetic Labyrinth: The Letters in
              History and Imagination", 1995.

   [RFC3536]  Hoffman, P., "Terminology Used in Internationalization in
              the IETF", RFC 3536, May 2003.

              Hoffman, P., "A Method for Registering Internationalized
              Domain Names", draft-hoffman-idn-reg-02.txt (work in
              progress), October 2003.

Klensin                 Expires October 29, 2004               [Page 20]

Internet-Draft              IDN Registration                  April 2004

Author's Address

   John C Klensin
   1770 Massachusetts Ave, #322
   Cambridge, MA  02140

   Phone: +1 617 491 5735
   EMail: john-ietf@jck.com

Klensin                 Expires October 29, 2004               [Page 21]

Internet-Draft              IDN Registration                  April 2004

Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   intellectual property or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; neither does it represent that it
   has made any effort to identify any such rights. Information on the
   IETF's procedures with respect to rights in standards-track and
   standards-related documentation can be found in BCP-11. Copies of
   claims of rights made available for publication and any assurances of
   licenses to be made available, or the result of an attempt made to
   obtain a general license or permission for the use of such
   proprietary rights by implementors or users of this specification can
   be obtained from the IETF Secretariat.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights which may cover technology that may be required to practice
   this standard. Please address the information to the IETF Executive

Full Copyright Statement

   Copyright (C) The Internet Society (2004). All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works. However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assignees.

   This document and the information contained herein is provided on an

Klensin                 Expires October 29, 2004               [Page 22]

Internet-Draft              IDN Registration                  April 2004



   Funding for the RFC Editor function is currently provided by the
   Internet Society.

Klensin                 Expires October 29, 2004               [Page 23]

Html markup produced by rfcmarkup 1.111, available from https://tools.ietf.org/tools/rfcmarkup/