[Docs] [txt|pdf] [Tracker] [Email] [Nits]

Versions: 00 01 02 03 RFC 3696

Network Working Group                                        J. Klensin
Internet-Draft                                            February 2003
Expires August 2003

               User Interface Evaluation and Filtering of
                    Internet Addresses and Locators

                       draft-klensin-name-filters-00.txt

Status of this Memo

     This document is an Internet-Draft and is subject to all provisions
     of Section 10 of RFC2026

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as
     Internet-Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as
     "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/1id-abstracts.html

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html

Abstract

        Many internet applications have been designed to deduce top-level
        domains (or other domain name labels from partial information.
        Whether or not this practice is desirable from an overall network
        standpoint, the designers of the applications believe the it leads
        to be better and more responsive user experience.  The introduction
        of new top level domains, expecially non-country-code ones, has
        exposed flaws in some of the methods used by these applications.
        These flaws make it more difficult, or impossible, for users of the
        applications to access the full Internet.  This memo discusses the
        used techniques and gives some guidance for minimizing their
        negative impact as the domain name environment evolves.

1. Introduction

Designers of user interfaces to Internet applications have often found
it useful to examine user-provided values for validity before passing
them to the Internet tools themselves.  This type of test, most commonly
involving syntax checks or application of other rules to domain names,
email addresses, or "web addresses" (URLs or, occasionally, extended URI
forms) may enable better-quality diagnostics for the user than might be
available from the protocol itself.  They are also thought to improve
the efficiency of back-office processing programs and to reduce load on
the protocols themselves.  Certainly they are consistent with the
well-established principle that it is better to detect errors as early
as possible.
The tests must, however, be made correctly or at least safely.  If
criteria are applied that do not match the protocols, users will be
inconvenienced, addresses and sites will effectively become inaccessible
to some groups, and business and communications opportunities will be
lost.  Experience in recent years indicates that syntax tests are often
performed incorrectly, perhaps by assuming that the syntax rules are the
same for email addresses and URLs, and that tests for top-level domain
names are applied using obsolete lists and conventions.  We assume that
most of these incorrect tests are the result of inability to
conveniently locate exact definitions for the criteria to be applied.
This document draws summaries of the applicable rules together in one
place and supplies references to the actual standards.  It does not add
anything to those standards; it merely draws the information together
into a form that may be more accessible.

Many experts on Internet protocols believe that tests and rules of
these sorts should be avoided in applications and that the tests in the
protocols and back-office systems should be relied on instead.
Certainly implementations of the protocols cannot assume that the data
passed to them will be valid.  Unless the standards specify particular
behavior, this document takes no position on whether or not the testing
is desirable.  It only identifies the correct tests to be made if tests
are to be applied.

The sections that follow discuss domain names, email addresses, and
URLs.

2. Restrictions on domain (DNS) names

The authoritative definitions of the format and syntax of domain names
appear in RFCs 1035, 1123, and 2181 ([RFC1035], [RFC1123], [RFC2181]).

Any characters, or combination of bits, is permitted in DNS names.
However, there is a preferred form that is required by most
applications.  That form has been the only form permitted in TLD names,
and most second-level names registered in TLDs.  It is known as the "LDH
rule", after the characters that it permits.  The LDH rule, as updated,
provides that the labels (words or strings separated by periods) that
make up a domain name must consist only of the ASCII [ASCII] alphabetic
and numeric characters, plus the hyphen.  No other symbols or
punctuation characters are permitted, nor is blank space.  If the hyphen
is used, it is not permitted to appear at either the beginning or end of
a label.  There is an additional rule that essentially requires that
top-level domain names not be all-numeric.

Internet protocols are designed to work well only when given
"fully-qualified" domain names, i.e., ones that include all of the
labels leading to the root, including the TLD name.  Consequently,
proported DNS names to be used in applications and to locate resources
generally must contain at least one period (".") character.  Those that
do not are either invalid or require the application to supply
additional information.  Of course, this principle does not apply when
the purpose of the application is to process or query TLD names
themselves.

There is a long history of applications moving beyond the "one or more
periods" test to trying to verify that a valid TLD name is actually
present.  They have done this either by applying some heuristics to the
form of the name or by consulting a local list of valid names.  The
heuristics are no longer effective.  If one is to keep a local list,
much more effort must be devoted to keeping it up-to-date than was the
case several years ago.

The heuristics were based on the observation that, since the DNS was
first deployed, all top-level domain names were two, three, or four
characters in length.  All two-character names were associated with
"country code" domains, with the specific labels (with for a few early
exceptions), drawn from the ISO list of codes for countries and similar
entitles [IS3166].  The three-letter names were "generic" TLDs, whose
function was not country-specific.  And there was exactly one
four-letter TLD, the infrastructure domain "ARPA." [RFC1591].  These
length-dependent rules were, however, conventions, rather than anything
on which the protocols depended.

Before the mid-1990s, lists of valid top-level domain names changed
infrequently.  New country codes were gradually, and then more rapidly,
added as the Internet expanded, but the list of generic domains did not
change at all between the establishment of the "INT." domain and ICANN's
allocation of new generic TLDs in 2000.  Some application developers
responded by assuming that any two-letter domain name could be valid as
a TLD, but that the list of generic TLDs was fixed and could be kept
locally and tested.  Several of these assumptions changed as ICANN
started to allocate new top-level domains: one two-letter domain that
does not appear in the ISO 3166 table was tentatively approved, and new
domains were created with three, four, and even six letter codes.

As of the first quarter of 2003, the list of valid, non-country,
top-level domains was .aero, .biz, .com, .coop, .edu, .gov, .info, .int,
.mil, .museum, .name, .net, .org, .pro, and arpa.  ICANN is expected to
expand that list at regular intervals, so the list that appears here
should not be used in testing.  Instead, systems that filter by testing
top-level domain names should regularly check the current list of TLDs
(both "generic" and country-code-related) published by IANA at
http://www.iana.org/domain-names.htm.  It is likely that the better
strategy has now become to make the "at least one period" test, to
verify LDH conformance (including verification that the apparent TLD
name is not all-numeric), and then to use the DNS to determine domain
name validity, rather than trying to maintain a local list of valid TLD
names.


3. Restrictions on email addresses

Reference documents: RFC 2821, RFC 2822

Contemporary email addresses consist of a "local part" separated from a
"domain part" by an at- sign "@".  The syntax of the domain part
corresponds to that in the previous section, and the same concerns about
filtering and lists of names apply.  The domain name can also be
replaced by an IP address in square brackets, but that form is strongly
discouraged except for testing and troubleshooting purposes.

The local part may appear using the quoting conventions described below.
The quoted forms are rarely used in practice, but are required for some
legitimate purposes and should not be rejected.  Subject to the quoting
constraints, any ASCII character, including control characters, may
appear quoted, or in a quoted string.  The backslash character is used
to quote the following character.  For example

 Abc\@def@example.com

is a valid form of an email address.  Blank spaces may also appear, as
in

    Fred\ Bloggs@example.com

The backslash character may be used to quote itself, e.g.,

  Joe.\\Blow@example.com

Conventional double-quote characters may be used to surround strings.
For example

     "Abc@def"@example.com
     "Fred Bloggs"@example.com

are alternate forms of the first two examples above.  The quoted forms
are rarely recommended, and are uncommon in practice, except insofar as
they are needed for transitions from other systems and contexts, but
those transitional requirements still arise.

Without quotes, local-parts may consist of any combination of alphabetic
characters, digits, or any of the special characters

 ! # $ % & ' * + - / = ?  ^ _ ` { | } ~

period (".") may also appear, but may not be used to start or end the
local part, nor may two or more consecutive periods appear.  Forms such
as

     user+mailbox@example.com
     customer/department=shipping@example.com
     $A12345@example.com
     !def!xyz%abc@example.com
     _somename@example.com

are valid and are seen fairly regularly, but any of the characters
listed above are permitted.  In the context of local parts, apostrophe
("'") and acute accent ("`") are ordinary characters, not quoting
characters.  Some of these characters are used in conventions about
routing or other types of special handling by some receiving hosts.
But, since there is no way to know whether the remote host is using
those conventions or just treating these characters as normal text,
sending programs (and programs evaluating address validity) must simply
accept the strings and pass them on.


4. URLs

4.1 URL syntax definitions and issues

The syntax for URLs (Uniform Resource Locators) is specified in RFC 1738
[RFC1738].  The syntax for the more general "URI" (Uniform Resource
Identifier) is specified in RFC 2396 [RFC2396].  Programs that require
syntax checks should use the general syntax rules of RFC 2396, which are
the rules summarized below.

<<to be supplied>>

4.2 Guessing domain names in web contexts

Several web browsers have adopted a practice that permits an incomplete
domain name to be used as input instead of a complete URL.  This has,
for example, permitted users to type "microsoft" and have the browser
interpret the input as "http://www.microsoft.com/".  Other browser
versions have gone even further, trying to build DNS names up through a
series of heuristics, testing each variation in turn to see if it
appears in the DNS, and accepting the first one found as the intended
domain name.  If this approach is to be used, it is often critical that
the browser recognize the complete list of TLDs.  If an incomplete list
is used, complete domain names may not be recognized as such and the
system may try to turn them into completely different names.  For
example, "example.aero" is a fully-qualified name, since "aero." is a
TLD name.  But, if the system doesn't recognize "aero." as a TLD name,
it is likely to try to look up "example.aero.com" and
"www.example.aero.com" (and then fail or find the wrong host), rather
than simply looking up the user-supplied name.

As discussed in section 2 above, there are dangers associated with
software that attempts to "know" the list of top-level domain names
locally and take advantage of that knowledge.  These name-guessing
heuristics are another example of that situation: if the lists are
up-to-date and used carefully, the systems in which they are embedded
may provide an easier, and more attractive, experience for at least some
users.  But finding the wrong host, or being unable to find a host even
when its name is precisely known, constitute bad experiences by any
measure.

5. Implications of internationalization

6. Summary

<<to be supplied>>

7. Security considerations

Since this document merely summarizes the requirements of existing
standards, it does not introduce any new security issues.  However, many
of the techniques that motivate the document raise important security
issues of their own.  Rejecting valid forms of domain names, email
addresses, or URIs often denies service to the user of those entities.
Worse, guessing at the user's intent when an incomplete address, or other
string, is given can result in compromises to privacy or accuracy of
reference if the wrong target is found and returned.  From a security
standpoint, the optimum behavior is probably to never guess, but,
instead, to force the user to specify exactly what is wanted.  When that
position involves a tradeoff with an acceptable user experience, good
judgment should be used and the fact that it is a tradeoff recognized.
8. References

8.1 Normative References

[ASCII] American National Standards Institute (formerly United States of
America Standards Institute), X3.4, 1968, "USA Code for Information
Interchange". ANSI X3.4-1968 has been replaced by newer versions with
slight modifications, but the 1968 version remains definitive for the
Internet.

[RFC1035] Mockapetris, P.V., "Domain names - concepts and facilities",
RFC 1035 and STD 13, November 1987.

[RFC1123] Braden, R., Ed., "Requirements for Internet Hosts -
Application and Support", RFC 1123 and STD 3, October 1989.

[RFC1738] Berners-Lee, T., L. Masinter, and M.  McCahill, "Uniform
Resource Locators (URL)", RFC 1738, December 1994.

[RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS
Specification", RFC 2181, July 1997.

[RFC2396 Berners-Lee, T., R. Fielding, and L. Masinter, "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396, August 1998.

8.2 Non-normative References

[IS3166]

[RFC1591] Postel, J., "Domain Name System Structure and Delegation",
March 1994.

9. Acknowledgements

<<to be supplied>>

10. Author's Address

John C Klensin
1770 Massachusetts Ave, #322
Cambridge, MA 02140  USA
john-ietf@jck.com

Expires August 2003


Html markup produced by rfcmarkup 1.107, available from http://tools.ietf.org/tools/rfcmarkup/