[Docs] [txt|pdf|xml|html] [Tracker] [Email] [Diff1] [Diff2] [Nits]

Versions: 00 01 02 03

Network Working Group                                         S. Leonard
Internet-Draft                                             Penango, Inc.
Intended status: Informational                             J. Hildebrand
Expires: May 9, 2019                                       Cisco Systems
                                                               T. Hansen
                                                       AT&T Laboratories
                                                       November 05, 2018


                 Regular Expressions for Internet Mail
                     draft-seantek-mail-regexen-03

Abstract

   Internet Mail identifiers are used ubiquitously throughout computing
   systems as building blocks of online identity.  Unfortunately,
   incomplete understandings of the syntaxes of these identifiers has
   led to interoperability problems and poor user experiences.  Many
   users use specific characters in their addresses that are not
   properly accepted on various systems.  This document prescribes
   normative regular expression (regex) patterns for all Internet-
   connected systems to use when validating or parsing Internet Mail
   identifiers, with special attention to regular expressions that work
   with popular languages and platforms.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on May 9, 2019.

Copyright Notice

   Copyright (c) 2018 IETF Trust and the persons identified as the
   document authors.  All rights reserved.





Leonard, et al.            Expires May 9, 2019                  [Page 1]


Internet-Draft                mail-regexen                 November 2018


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Normative Effects . . . . . . . . . . . . . . . . . . . .   4
     1.2.  Definitions . . . . . . . . . . . . . . . . . . . . . . .   4
   2.  History and Formal Models for Internet Mail Identifiers . . .   6
     2.1.  The Core History  . . . . . . . . . . . . . . . . . . . .   6
     2.2.  Multipurpose Internet Mail Extensions and Uniform
           Resource Identifiers  . . . . . . . . . . . . . . . . . .   9
     2.3.  Email Address Internationalization  . . . . . . . . . . .   9
     2.4.  The Data Model  . . . . . . . . . . . . . . . . . . . . .  10
       2.4.1.  Email Address . . . . . . . . . . . . . . . . . . . .  10
       2.4.2.  Message-ID  . . . . . . . . . . . . . . . . . . . . .  11
     2.5.  Equivalence and Comparison  . . . . . . . . . . . . . . .  11
       2.5.1.  Email Address . . . . . . . . . . . . . . . . . . . .  11
       2.5.2.  Message-ID  . . . . . . . . . . . . . . . . . . . . .  13
   3.  Regular Expressions for Email Addresses . . . . . . . . . . .  13
     3.1.  Deliverable Email Address . . . . . . . . . . . . . . . .  14
       3.1.1.  ASCII Building Blocks . . . . . . . . . . . . . . . .  14
       3.1.2.  Deliverable Email Address . . . . . . . . . . . . . .  14
       3.1.3.  (Leftover from draft-00) Basic Rules of Derivation
               (Unicode) . . . . . . . . . . . . . . . . . . . . . .  16
       3.1.4.  Complete Expression for Deliverable Email Address . .  17
       3.1.5.  Using Character Classes . . . . . . . . . . . . . . .  18
       3.1.6.  "Flotsam" and "Jetsam" Beyond ASCII . . . . . . . . .  18
       3.1.7.  Certain Expressions for Restrictions  . . . . . . . .  18
       3.1.8.  Unquoting Local-Part  . . . . . . . . . . . . . . . .  20
       3.1.9.  Quoting Local-Part  . . . . . . . . . . . . . . . . .  20
     3.2.  Modern Email Address  . . . . . . . . . . . . . . . . . .  20
     3.3.  Legacy Email Address  . . . . . . . . . . . . . . . . . .  21
     3.4.  Algorithms for Detecting Email Addresses  . . . . . . . .  21
     3.5.  Handling Domain Names . . . . . . . . . . . . . . . . . .  22
   4.  Regular Expressions for Message-IDs . . . . . . . . . . . . .  22
     4.1.  Modern Message-ID . . . . . . . . . . . . . . . . . . . .  22
     4.2.  General Message-ID  . . . . . . . . . . . . . . . . . . .  23
   5.  Security Considerations . . . . . . . . . . . . . . . . . . .  23
   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  24
     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  24



Leonard, et al.            Expires May 9, 2019                  [Page 2]


Internet-Draft                mail-regexen                 November 2018


     6.2.  Informative References  . . . . . . . . . . . . . . . . .  26
   Appendix A.  Test Vectors . . . . . . . . . . . . . . . . . . . .  28
   Appendix B.  Change Log . . . . . . . . . . . . . . . . . . . . .  28
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  28

1.  Introduction

   Internet Mail is everywhere.  This fact of modern connected life is
   so self-evident that [RFC5598] says: "In practical terms, an email
   address string has become the common identifier for representing
   online identity."  [MTECHDEV] acknowledges that email "has been one
   of the major technical and sociological developments of the past 40
   years."  Whether it is joining a social network, participating in a
   forum, blogging, paying taxes, buying products, conducting
   professional correspondence, or communicating with loved ones, one's
   email address forms the cornerstone or backstop (frequently both) to
   these methods of communication.  Internet mail is not only
   ubiquitous: it is essentially free for all users connected to the
   Internet.

   Yet it is surprising how fragile or cavalier many systems are with
   their treatment of Internet Mail identifiers, namely with email
   addresses.  Prominent government agencies, financial institutions,
   and even major mail services reject a variety of forms that are in
   wide use by many user communities.  [[NB: do a survey.]] For example,
   in the intervening time between IETF 94 and the submission of this
   Internet-Draft, the author interacted with not less than 25 different
   web or other Internet-connected services that rejected or mangled his
   perfectly valid email addresses.  The result is a pernicious and
   creeping degradation of mail service and of the usability of the
   Internet Mail infrastructure, resulting in undelivered mail,
   misdelivered mail (which can constitute a security vulnerability),
   and denial of service.

   The Internet Mail standards, like the mail system, have evolved over
   time and have been modified to accommodate volumes and scenarios far
   beyond their original design goals.  Furthermore, some identifier
   forms have been restricted over time as certain syntaxes were
   determined to be harmful, arcane, or just plain useless.  [[So while
   not "blame", some responsibility or causation lies with these
   standards, which go out of their way to balance backwards-
   compatibility, complexity, completeness, and flexibility at the
   expense of a simple and widely-implementable addressing format.]]

   This document prescribes normative regular expression (regex)
   patterns for all Internet-connected systems to use when validating or
   parsing Internet Mail identifiers.  Attention specifically focuses on
   "email address" (the specification for the string commonly associated



Leonard, et al.            Expires May 9, 2019                  [Page 3]


Internet-Draft                mail-regexen                 November 2018


   with a single mailbox at a single named entity), and Message-ID,
   which share nearly identical syntax, but have different use cases and
   semantics.  First, the history of Internet Mail is traced to build a
   coherent data model for Internet Mail identifiers.  Second, relevant
   expression formats are discussed.  Third, expressions to fit the
   identifiers in a variety of computing contexts are developed and
   presented.  The overall goal of this document is to establish cut-
   and-dried algorithms that developers can incorporate directly into
   their mail-using products (including web browsers, form validators,
   and software libraries), replacing current ad-hoc (and oftentimes
   atrociously inconsistent) approaches with standardized behavior.

1.1.  Normative Effects

   This document is proposed with either Informational or Best Current
   Practice status [RFC1818] for all users and systems of Internet Mail
   identifiers, which basically means everyone connected to the
   Internet, other than the mail infrastructure itself.  The Internet
   Mail infrastructure has been standardized (and continues to be
   standardized by) other documents, most notably [RFC5321] and
   [RFC5322].  Therefore, implementers developing mail systems MUST rely
   on those standards when building interoperable mail systems.  At the
   same time, the text of this specification has been [[NB: will be]]
   carefully vetted by [[the IETF]] so that implementers SHALL be able
   to rely on it as a normative reference.  Whether designing a new
   standard or implementing a new system that uses Internet Mail
   identifiers for some other purpose (e.g., as usernames, security
   principals, or keys in a database), relying parties can "copy-and-
   paste" the expressions in this document to parse, validate, compose,
   and process Internet Mail identifiers, rather than relying on
   homegrown solutions.

   Internet Mail has evolved over forty years, and will undoubtedly
   continue to evolve over time.  This document does not constrain that
   development process.  Actually, it is expected that expressions in
   this document will be updated to match changes in Internet Mail.

1.2.  Definitions

   The terms "email address" and "address" (without qualification) refer
   to the string commonly associated with a single mailbox at a single
   named entity.  In this document, the prose text always qualifies
   these terms with the source document when using a different sense.

   The term "Message-ID" (without qualification) refers to the globally
   unique string comprised of a left part, the at-sign (@), and a right-
   part, that is used to identify a single message.  In this document,
   the prose-text always qualifies this term when using a different



Leonard, et al.            Expires May 9, 2019                  [Page 4]


Internet-Draft                mail-regexen                 November 2018


   sense.  The term "Message-ID field", for example, refers to the
   Message-ID Header Field, which includes the characters "Message-ID:"
   and the surrounding "<" and ">" angle brackets.

   Unquoted all-capital symbols in prose text have the meanings
   specified in [I-D.seantek-abnf-more-core-rules].

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   [RFC2119].

   This document provides expressions that can be used directly in
   popular computing platforms.  An important subset of email address
   syntaxes, namely deliverable email addresses, can be described in a
   regular language [[CITE]].  A regular language is a language
   recognized by (and computable with) a finite automaton.  [[CITE:
   Kleene's Theorem]] However, the full syntax of email addresses
   requires a context-free language (i.e., governed by ABNF) and a
   pushdown automaton.

   The term "regular expression" is a sequence of characters that define
   a search pattern for string matching.  Originally the term referred
   to expressions that described regular languages in formal language
   theory.  Formal regular expressions are limited to:

   o empty set and empty string o literal character o concatenation o
   alternation o repetition (Kleene star)

   Modern-day libraries support expressions that far exceed the regular
   languages.  Some libraries even support capabilities that exceed the
   context-free languages.  However, this document limits itself to
   truly regular grammars where possible, and where not possible, to
   context-free grammars.  Implementers can therefore implement (or
   compile) these specifications on computing-constrained devices.

   The regular expressions in this document are intended to conform to
   the following de-jure or de-facto standards.  Where expressions are
   given, they are annotated with single characters that refer to the
   standards to which they conform.  [[NB: or, are intended to conform
   to, after further development.]]










Leonard, et al.            Expires May 9, 2019                  [Page 5]


Internet-Draft                mail-regexen                 November 2018


   +---+---------+-----------------------------------------+-----------+
   | 1 | 4       | Title                                   | Ref       |
   +---+---------+-----------------------------------------+-----------+
   | P | PCRE:   | Perl Compatible Regular Expressions 2   | [PCRE2]   |
   |   |         | (version 10.21, January 12, 2016)       |           |
   |   |         |                                         |           |
   | E | (P)ERE: | POSIX Extended Regular Expressions      | [POS-ERE] |
   |   |         | (POSIX/IEEE Std 1003.1, Issue 7, 2013   |           |
   |   |         | Ed.) [[NB: could be ERE, PERE, or       |           |
   |   |         | P-ERE]]                                 |           |
   |   |         |                                         |           |
   | J | JSRE:   | JavaScript Regular Expressions          | [JSRE6ED] |
   |   |         | (ECMAScript/ECMA-262, 6th Ed., 2015)    |           |
   +---+---------+-----------------------------------------+-----------+

   [[TODO: Need to do something with UTS #18: Unicode Regular
   Expressions.]]

   Implementers should exercising caution when using a library that
   claims to be "Perl-Compatible" without actually being the bona-fide
   PCRE library: it may exhibit different or incomplete behavior.
   Implementers should also note that ERE and JSRE are fully implemented
   as alternative grammars in the std::regex library of C++11 and its
   successor, C++14.  [[TODO: cite C++ standards.]] In the absence of a
   "live" regular expression library, the expressions in this document
   are easily compiled into automata (i.e., target language code) using
   well-studied algorithms.

   Surrounding delimiters (i.e., slashes) are omitted unless relevant to
   the proffered usage.

2.  History and Formal Models for Internet Mail Identifiers

2.1.  The Core History

   Internet Mail (also known as electronic mail, e-mail, email, or
   simply "mail") is an asynchronous, store-and-forward method of
   exchanging digital messages from an author to one or more recipients.
   [MTECHDEV] recounts the technical development of email, which is too
   voluminous to be repeated here.  This section specifically focuses on
   the history of the identifiers used in email.

   When people think of an email address, <user@example.com> is the
   quintessential example.  However, addresses did not always look this
   way.  Electronic mailing systems come from the earliest days of
   networking, before ARPANET.  The specification for identifiers was
   defined by the networking project on a more-or-less ad-hoc basis.
   When ARPANET began, mail and file transfer were seen as important



Leonard, et al.            Expires May 9, 2019                  [Page 6]


Internet-Draft                mail-regexen                 November 2018


   founding services.  In fact, FTP was a significant transport
   mechanism for mail.

   By 1973, users of ARPANET came together to propose [RFC0524] and
   standardize [RFC0561] parts of the mail system.  In these early
   specifications, the at-sign "@" was not used; instead, the term "AT"
   separated the left-side (user production) from the right-side (host
   production).  Tokens (the word production, specifically) during that
   time were separated by SP, CR, and LF.  The word production was
   defined at that time to be any ASCII character other than CR, LF, and
   SP.  [RFC0524] only standardized the From header, not any recipient
   headers.

   In 1975, [RFC0680] proposed a more formal format for for message
   fields beyond [RFC0561], including "receiver specification fields"
   (TO, CC, and BCC) and "reference specification fields" (MESSAGE-ID,
   IN-REPLY-TO, REFERENCES, and KEYWORDS) for the first time.  The
   receiver fields used mailbox productions with user and host parts
   separated by "@" (without spaces), in contrast to the originator
   specification fields (specifically FROM and SENDER), which continued
   to use "AT" (with single spaces on either side).  An [RFC0680]
   Message-ID was structured with a "Net Address" (presumably, a host
   name) in square brackets, followed by a line production (any
   characters other than CR or LF).

   The first real email standards that resemble modern usage were
   published in 1977: [RFC0724] and its revision, [RFC0733].  Those
   documents are historic and their formats are incompatible with most
   modern mail systems, but understanding them provide important
   insights into the structure of identifiers starting with [RFC0821]
   and [RFC0822].  The "@" character gained greater prominence, although
   "AT" (then lowercased to "at" in the specifications) was still
   supported.  Importantly, [RFC0733] Message-ID was standardized to
   match the [RFC0733] email address format, and a uniqueness guarantee
   was added: "The uniqueness of the message identifier is guaranteed by
   the host which generates it."  Essentially, this text implies that
   the right-hand part of the Message-ID is to be a hostname, or
   operatively associated with a hostname (i.e., not just a random
   string, but a unique-possibly random-string assigned by a host).  At
   this point, Message-ID and email address specifications converged in
   the host-phrase production.

   [RFC0821] and [RFC0822] are the foundational RFCs for modern mail
   usage, and their revisions over the years retain the same basic
   structure and division between mail transfer (specifically, SMTP) and
   mail format.  Jon Postel's invention of SMTP grew out of experience
   and disenchantment with the Mail Transfer Protocol (MTP) [RFC0772].
   The Simple Mail Transfer Protocol is indeed simple: it is structured



Leonard, et al.            Expires May 9, 2019                  [Page 7]


Internet-Draft                mail-regexen                 November 2018


   like FTP but only has a limited set of commands: HELO, MAIL FROM,
   RCPT TO, DATA, QUIT, VRFY, and EXPN (and a few others not relevant
   for this discussion).  The commands can take at most one argument.
   [RFC0821] describes a forward-path argument rather than an email
   address argument: the distinction is that after the username, a
   series of "@" host specifications designate the hops through which
   the message is supposed to travel before it reaches its destination.

   The main [RFC0821] production for an email address is the <mailbox>,
   which is defined as <local-part> "@" <domain>.  The <local-part>
   production can be a <dot-string> or a <quoted-string>.  In both
   cases, the full range of ASCII characters is actually permitted,
   although different characters must be backslash-escaped in different
   productions.  The <domain> production is extremely broad and can
   "stack" domain-element components with the period "."; a domain-
   element component can be a hostname, "#" and a number, or an IPv4
   address enclosed in square brackets "[" and "]".

   [RFC0822] introduced the "addr-spec" ABNF production, which is that
   series' term for an email address.  While a [RFC0822] route-addr
   production can include a source route (aka forward-path with multiple
   hosts), addr-spec is noted to be global address, with the right side
   being a "domain" production.  This definition presaged the first DNS
   standards [RFC0882] and [RFC0883], although it was clearly designed
   with DNS in mind.  The [RFC0822] ABNF permits a domain to include
   multiple domain-literal productions (i.e., bracketed) separated by
   "."; however, the accompanying text basically obviates such
   productions.  As [RFC0822] presaged the widespread implementation of
   DNS, various systems would spread routing information between the
   local-part and domain productions (see Section 6.2.5).  [RFC0822]
   discusses local-part syntax extensively, including examples of
   comment productions that are supposed to be ignored semantically (see
   Section A.1).

   Section 3.4.7 of [RFC0822] describes the constituent components of an
   address as requiring "preservation of case information", which is
   slightly different than saying "case-sensitive" outright (although
   the latter is strongly implied).  The main historical point to glean
   is that intermediate mail systems were supposed to transit the local-
   part AS-IS without modification, so that the destination system--and
   only the destination system--would parse it.

   [RFC0822] assigns specific semantics to Message-ID but is light on
   syntax: the msg-id production is just addr-spec enclosed in mandatory
   "<" and ">".

   The widespread deployment of DNS [RFC0973] (later [RFC1034] and
   [RFC1035], with [RFC1123] relaxation) dramatically changed the



Leonard, et al.            Expires May 9, 2019                  [Page 8]


Internet-Draft                mail-regexen                 November 2018


   Internet messaging landscape of the late 1980s.  [[TODO: complete.]]
   With respect to email addresses and Message-IDs, it became obvious
   that the right-hand side of the "@" was supposed to represent a
   fully-qualified domain name at which Mail eXchanger records are
   located [[TODO: cite]].  Other networks and host-naming formats
   became obsolete by the mid-1990s.

   The 2001 standards [RFC2821] and [RFC2822] reflect the explosive
   growth and diverse deployments of Internet mail.  The work was
   undertaken in the DRUMS (Detailed Revision/Update of Message
   Standards) working group between 1995 and 2001.  [RFC2822] introduced
   the "obs-" prefix in its ABNF, along with Section 4, Obsolete Syntax.
   Essentially, [RFC2822] prescribes a generation format that is much
   stricter than the parsing format, but still demands that conforming
   implementations understand the parsing format.  Therefore, the U+0000
   NULL character that is considered obsolete in [RFC2822] must still be
   considered part of the data model of email addresses.  The syntax of
   [RFC2821] is tighter than [RFC2822], and a careful read makes it
   apparent that the underlying address formats diverged in the
   intervening years.  Specifically, [RFC2822] is much more concerned
   with historical forms than [RFC2821], which is more about
   contemporary transmission behavior.  At the same time, [RFC2821] does
   not actually prohibit a wide range of C0 control characters, which
   still remain part of [RFC2821]'s data model.  [[TODO: complete
   history of RFC 282x.]]

   Revisions to the base mail standards most recently completed,
   [RFC5321] and [RFC5322], were worked on between 2005 and 2008.  Email
   addresses saw further character restrictions, namely around the
   entire range of C0 control characters, which [RFC5321] explicitly
   prohibits.  [[TODO: complete history of RFC 532x.]]

2.2.  Multipurpose Internet Mail Extensions and Uniform Resource
      Identifiers

   [[Message-ID and Content-ID.  Therefore this document applies with
   equal normative force to Content-IDs, and to mid: and cid: URIs that
   use them.]]

2.3.  Email Address Internationalization

   As the Internet became a ubiquitous feature of modern life, and as
   email followed it, users in various countries called for identifiers
   that were usable in their native languages.  The IETF [[worked on
   this]] in the email address internationalization (EAI) effort,
   culminating in [RFC6530], [RFC6531], and [RFC6532].  The key changes
   for identifiers were to expand the character repertoire of email
   addresses, and anything looking like email addresses (e.g., Message-



Leonard, et al.            Expires May 9, 2019                  [Page 9]


Internet-Draft                mail-regexen                 November 2018


   ID), to include UTF-8 octets.  As UTF-8 can encode any Unicode scalar
   value (but not any Unicode code point), the practical result is that
   addresses and Message-IDs can contain (almost) any Unicode scalar
   value.

   A practical result of EAI combined with MIME and MIME URIs, is that
   MIME URIs now are UTF-8 encoded in the pct-encoded production.

2.4.  The Data Model

   Parties that rely on this document SHALL interpret the semantically
   meaningful parts of Internet Mail identifiers as follows.

2.4.1.  Email Address

   An email address is comprised of a local-part and a domain, separated
   by "@".  The parsed local-part is a sequence of Unicode scalar
   values, and can be represented by any well-formed Unicode encoding.
   [[NB: The following may be CONTROVERSIAL!!]] The parsed domain is a
   sequence of restricted Unicode scalar values that represent some
   identifier for some host on the network.  The parsed domain can be:

   1.  a hostname of any kind, including (but not limited to) a DNS
       name;

   2.  an IPv4 address literal;

   3.  an IPv6 address literal;

   4.  an address literal, comprised of a standardized tag and a
       sequence of ASCII characters.

   Conveniently, the domain string subtypes can be combined into a
   single well-formed Unicode string, discriminated as follows:

   1.  If the string begins with "IPv6:", it is a type 3 IPv6 address,
       and the remainder had better be a valid IPv6 address in textual
       form.

   2.  If the string begins with Ldh-str and a colon ":", it is a type 4
       address literal, and the remainder had better be dcontent (which
       notably is not supposed to contain characters beyond ASCII).

   3.  If the string has four sets of digits 0-255 separated by dots,
       then it is an IPv4 address.

   4.  Otherwise, it had better be a domain name (i.e., comprised of NR-
       LDH labels and U-labels, separated by dots).  [[NB: [RFC1912]



Leonard, et al.            Expires May 9, 2019                 [Page 10]


Internet-Draft                mail-regexen                 November 2018


       says a label can't be all-numeric, but then it catalogs some
       exceptions.]]

   5.  Finally, it is "some random Unicode string" that is syntactically
       valid under the most expansive rules, but is not useful for
       delivering or reporting on Internet mail.

2.4.2.  Message-ID

   A Message-ID is comprised of a left part (id-left) and a right part
   (id-right), separated by "@".  The id-left is a sequence of Unicode
   scalar values, and can be represented by any well-formed Unicode
   encoding.  [[NB: The following may be CONTROVERSIAL!!]] The id-right
   is a sequence of Unicode scalar values, and can be represented by ay
   well-formed Unicode encoding.

   (A Content-ID [RFC2045] has the same composition as a Message-ID.)

2.5.  Equivalence and Comparison

2.5.1.  Email Address

   Two email addresses are equivalent if the parsed local-part and
   parsed domain values are equivalent.

   Two parsed local-parts are equivalent if their Unicode scalar values
   are equal.

   The special values "postmaster", "abuse", and [TODO: fill out] SHALL
   be compared case-insensitively [TODO: full-width characters?  Unicode
   case folding? etc.].

   [Case sensitivity] In all other cases, two parsed local-parts may be
   equivalent if the [TODO: receiving MTA] delivers mail addressed to
   them to the same mailbox.  There is no algorithmic comparison to
   determine said equivalence.

   Two parsed domains are equivalent if both have the same type, and the
   values are equivalent.  Additionally, an IPv6 address literal is
   equal to an IPv4 address literal when the IPv6 address is an
   "IPv4-mapped IPv6 address", and its IPv4 component equals the IPv4
   address of the IPv4 address literal.

   1.  hostnames (domain names): equal [TODO: RFC-REF, IDNA, RFC1034,
       RFC1035, etc.].

   2.  IPv4 addresses: equal [TODO: RFC-REF].




Leonard, et al.            Expires May 9, 2019                 [Page 11]


Internet-Draft                mail-regexen                 November 2018


   3.  IPv6 addresses: equal [TODO: RFC-REF].

   4.  address literal: the standardized tag is equal (case-
       insensitive), and the address value is octet-for-octet equal
       (case-sensitive), or is equal per the rules standardized by the
       standardized tag registration.

2.5.1.1.  Case sensitivity of local-part

   Of all equivalence issues, no issue generates more confusion and
   dissent in the email community than the case sensitivity of the
   local-part.  Formally local-part is case-sensitive.  A significant
   fraction of installed mail servers treat local-part as case-
   insensitive in the ASCII range.  (At the time of this writing, EAI
   has not been implemented widely enough to make statements about case
   insensitivity for characters beyond ASCII.)  [[TODO: survey and
   statistics.]] Furthermore, many systems [[TODO: quantify]] outside of
   the Internet Mail infrastructure compare the local-part of email
   addresses case-insensitively.

   Historically, local-parts were case-sensitive because of MULTICS.
   However, as time went on they became case-preserving: receiving
   systems would not register additional mailbox names (i.e., local-
   parts) if the proposed mailbox name differed from an existing mailbox
   name only by case.

   [[TODO: develop recommendation about how wise or unwise it is to go
   one way or another.]] One possible approach is to define an
   additional output, "conditionally equivalent" of the equivalence
   algorithm.  Therefore an implementation conforming to this document
   SHALL output one of three states: equivalent, not equivalent, and
   "conditionally equivalent", that is, equivalent if and only if local-
   part is compared case-insensitively in the ASCII range [[TODO: should
   we say in the full Unicode range?]].  Applications implementing this
   algorithm SHOULD NOT treat such a state as "equivalent".  For
   example, a user-facing application SHOULD treat this state as a
   "warning" that requires further intervention.

   In modern times, email addresses tend to be emitted [[TODO:
   statistics]] in all-lowercase, when case is normalized.  Therefore,
   applications implementing this algorithm [[TODO: document]] that are
   aware that the receiving MTA is case-insensitive, as well as
   applications implementing this algorithm that receive input that is
   case-ambiguous (such as voice input), SHOULD record the local-part in
   all-lowercase unless presented with evidence to the contrary.

   The EAI standards (RFCs 6530-6532) make no mention of case
   sensitivity issues for characters beyond the ASCII range.  Permitting



Leonard, et al.            Expires May 9, 2019                 [Page 12]


Internet-Draft                mail-regexen                 November 2018


   Unicode scalar values (i.e., UTF-8) opens up a whole range of
   comparison issues with potentially far-reaching identity and security
   implications.  [[TODO: discuss case preservation and sensitivity
   issues in characters beyond the ASCII range.  The bottom line is, use
   what you're given, don't mess with it, hope for the best.]]

2.5.2.  Message-ID

   Two Message-IDs are equivalent if the parsed id-left and parsed id-
   right values are equivalent.

   Two parsed id-left values are equivalent if their Unicode scalar
   values are equal.

   [[NB: controversial!]] Two parsed id-right values are equivalent if
   their Unicode scalar values are equal.

   [[TODO: The case sensitivity of id-right has not been fully explored
   in any standard to-date.  To the extent that id-right represents a
   domain name, there is a strong argument to treat id-right as case-
   insensitive in the ASCII range.  Standardized tags are probably case-
   insensitive based on the ABNF of [RFC5321] relating to "IPv6".  The
   rest is kind of up for grabs.  The bottom line is, if you intend to
   match the Message-ID of an existing message, don't take chances: just
   copy it verbatim into the destination.]]

   (Two Content-IDs are equivalent under the same rules.  However, a
   Content-ID and a Message-ID are never equal to each other, and if
   such a thing occurs, it is not correct because both Content-ID and
   Message-ID are supposed to be "world-unique" [RFC2045] [RFC5322].)

3.  Regular Expressions for Email Addresses

   [[Valid email address vs. deliverable email address]] A "deliverable
   email address" complies with the modern production rules of [RFC5321]
   and [RFC6531].  A deliverable email address SHALL have a domain part
   that is a domain name (Section 2.3.5 of [RFC5321]); it SHALL NOT have
   a domain part that is an address literal (Section 4.1.3 of [RFC5321])
   or a host name that does not comply with domain name rules (see
   [RFC1034], [RFC1035], [RFC5890] et. seq.).  [[TODO: justify.
   Basically experiments have borne out that many mail systems will not
   accept user@[3.3.3.3] as a RCPT TO.  Technically it is valid per RFC
   5321, but practically any receiving MTA that handles more than one
   MX/domain will have difficulty in figuring out what domain-specific
   mailbox to which to deliver the mail.  The author tried a couple of
   popular MTAs.]] Systems that use email addresses with an expectation
   of SMTP delivery SHALL accept productions that comply with this
   document.



Leonard, et al.            Expires May 9, 2019                 [Page 13]


Internet-Draft                mail-regexen                 November 2018


   Email addresses that do not meet these modern production rules may
   nevertheless be valid under the other modern (e.g., [RFC5322]) or
   legacy (e.g., [RFC0821] and [RFC0822]) production rules.

   Systems that recognize "modern email addresses" in new corpa (e.g.,
   in text editors) SHALL accept productions that comply with this
   document.

   Systems that recognize "legacy email addresses" in existing corpa
   (e.g., in email messages or documents predating this document) SHALL
   accept productions that comply with this document.

3.1.  Deliverable Email Address

   A deliverable email address is an email address that can be used to
   deliver messages over the modern SMTP infrastructure.  This has
   important implications for the domain part, because the domain part
   MUST represent a domain name that complies with contemporary rules
   and regulations, such as [RFC5890].

3.1.1.  ASCII Building Blocks

   The following rules are amalgamated from the SMTP and Internet
   message format standards [RFC5321] [RFC5322].  All expressions are
   PCRE2-compatible.

   (?(DEFINE)
    (?#local)
    (?<atext>[0-9A-Za-z!#-'*+\-/=?^_`{-~])
    (?<dot_string>(?&atext)+(?:\.(?&atext)+)*)
    (?<qtext>[ !#-\[\]-~])
    (?<quoted_pair>\\[ -~])
    (?<qcontent>(?&qtext)|(?&quoted_pair))
    (?<quoted_string>"(?&qcontent)*")
    (?<local_part>(?&dot_string)|(?&quoted_string))
    (?#domain)
    (?<oct>0*(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]?))
    (?<IPv4>(?&oct)(?:\.(?&oct)){0,3})
    (?<sub_domain>[0-9A-Za-z](?:[\-0-9A-Za-z]{0,61}[0-9A-Za-z])?)
    (?<domain>(?!(?&IPv4)[^\-.0-9A-Za-z])(?&sub_domain)
     (?:\\.(?&sub_domain))*(?![\-.0-9A-Za-z]))
   )

3.1.2.  Deliverable Email Address

   A deliverable email address matches the <Mailbox> production of
   [RFC5321].




Leonard, et al.            Expires May 9, 2019                 [Page 14]


Internet-Draft                mail-regexen                 November 2018


   (?&local_part)@(?&domain)

   (?&local_part)@(?&domain)(?<![\-.0-9A-Za-z]{254})

   ^(?&local_part)@(?&domain)$

   ^(?&local_part)@(?&domain)(?<![\-.0-9A-Za-z]{254})$

   The aforementioned patterns take into account the limitations of DNS
   [RFC1034], namely, that a label can only have 63 characters, and that
   a domain name can only have 253 characters.

   The negative-lookbehind on the domain (with {254}) essentially
   ensures that the "@" symbol is present no earlier than 253 preceding
   characters: PCRE2 cannot process variable-length lookbehinds.

   The repeater {0,61} in <sub_domain> is not sufficient to ensure a
   safe <domain> production, because what will happen is that the domain
   production might end after 63 consumed characters in a putative sub-
   domain, leaving overlong characters that are still part of the
   putative sub-domain hanging at the end.  (This is not an issue when
   the entire string is an address, in which case $ definitively
   terminates the string.)  The way to deal with this it the negative-
   lookahead on the end of <domain> that fails to match when additional
   domain characters are present.  This negative-lookahead also deals
   with some unpleasant corner cases when the <domain> is a partial IPv4
   address.

   The aforementioned patterns also take into account [RFC1912], namely,
   "[l]abels may not be all numbers".  Actually the way that software
   works is that if the putative domain name can be parsed as an IPv4
   address, then it is treated as an IPv4 address; otherwise, it is
   treated as a DNS name.  Therefore, character sequences that are valid
   IPv4 addresses need to be restricted out.  For example: "1.3.4.255"
   is invalid; but, "1.3.4.256" and "1.3.4.255.1" are valid, because
   they do not parse to IPv4 addresses. (411.org is valid for obvious
   reasons.)

   Borderline cases are partial IPv4 addresses, such as "411" (could be
   0.0.1.155) and "1.411" (could be 1.0.1.155).  The regular expressions
   above accept these domain productions, but they may not be safe.  The
   regular expressions above also accept hex form, such as "0xef" (could
   be 0.0.0.239).








Leonard, et al.            Expires May 9, 2019                 [Page 15]


Internet-Draft                mail-regexen                 November 2018


3.1.3.  (Leftover from draft-00) Basic Rules of Derivation (Unicode)

   The following rules are amalgamated from the SMTP standards [RFC5321]
   and [RFC6531], the foundational DNS standards [RFC1034], [RFC1035],
   and [RFC1123], and the modern IDNA standards [RFC5890], [RFC5891],
   and [RFC5892].

   [[NB: The syntax below assumes that Perl Compatible Regular
   Expressions 2 [PCRE2] is being used, such that \xnn and \x{...} refer
   to valid Unicode scalar values, i.e., well-formed UTF-8 sequences.
   Specifically, the surrogate range U+D800-U+DFFF is omitted.  Although
   listed as JSRE and ERE-compatible, these expressions will need to be
   massaged somewhat to handle the Unicode-referencing differences.]]

   [[TODO: There need to be two forms: a form that matches a complete
   string, so the regular expression can start with ^ and end with $.
   This will make the execution pretty fast.  A second form can match
   any string in a block of text.  This will be much more intensive
   because ^ and $ cannot be used; instead the boundary could be
   delimited by any of a HUGE yet noncontiguous quantity of characters
   beyond ASCII, such as fullwidth punctuation, spaces of various kinds,
   etc.]]





























Leonard, et al.            Expires May 9, 2019                 [Page 16]


Internet-Draft                mail-regexen                 November 2018


P E J atext = [A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]

P E J dot-string = [A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+
                   (?:\.[A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}])

P E J qtext = [ !#-\[\]-~\xA0-\x{10FFFF}]

P E J quoted-pair = \\[ -~]

P E J qcontent = [ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~]

P E J quoted-string = "(?:[ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~])*"

P E J local-part = [A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+
             (?:\.[A-Za-z0-9!#-'*+-/=?^_`{-~\xA0-\x{10FFFF}])|
             "(?:[ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~])*"

; RFC 5890, must contain at least one non-ASCII character
; TODO: express other constraints such as in Protocol document [RFC5891]
; and Tables document [RFC5892]
; WARNING: intentional omission: dot . is included in this production--
; it should not be
      U-label = [[TOO COMPLEX TO DO RIGHT NOW...]]
P E J U-label = [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*

P E J sub-domain = [A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?|
                   [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+
                   [\x00-\x{10FFFF}]*

P E J domain = (?:[A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?|
               [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*)
               (?:\.(?:[A-Za-z0-9](?:[A-Za-z0-9\-]*
               [A-Za-z0-9])?|[\x00-\x{10FFFF}]*
               [\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*))*

3.1.4.  Complete Expression for Deliverable Email Address

   The following regular expression is a deliverable email address:

; Mailbox from RFC 5321, as amended
P E J DEA     = ([A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+
                 (?:\.[A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}])|
                 "(?:[ !#-\[\]-~\xA0-\x{10FFFF}]|\\[ -~])*")@
                ((?:[A-Za-z0-9](?:[A-Za-z0-9\-]*[A-Za-z0-9])?|
                 [\x00-\x{10FFFF}]*[\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*)
                 (?:\.(?:[A-Za-z0-9](?:[A-Za-z0-9\-]*
                 [A-Za-z0-9])?|[\x00-\x{10FFFF}]*
                 [\x80-\x{10FFFF}]+[\x00-\x{10FFFF}]*))*)



Leonard, et al.            Expires May 9, 2019                 [Page 17]


Internet-Draft                mail-regexen                 November 2018


   In the regular expression DEA, capturing group 1 is the local-part
   production, and capturing group 2 is the domain production.

3.1.5.  Using Character Classes

   [[TODO: provide expressions that use character classes, and explain
   the benefits and tradeoffs.]]

3.1.6.  "Flotsam" and "Jetsam" Beyond ASCII

   As mail usage is international in scope, modern mail and mail
   identifier-using systems MUST support Unicode EAI identifiers.
   Unfortunately, rigorously following the EAI specifications [RFC6530],
   [RFC6531], and [RFC6532] will lead to (possibly) unforeseen text
   parsing problems, where naive (or strictly conforming) parsers will
   tend both to overconsume and underconsume non-ASCII text surrounding
   an otherwise "obvious" e-mail address.  The problem is described in
   this section, while the following Section 3.1.3 provides at least
   some partial mitigations.

   The right-hand side of a deliverable email address is a domain name.
   A conforming parser may well overconsume text on the right-hand side,
   aka "jetsam", that cannot possibly be in a domain name, such as non-
   ASCII punctuation, spaces, control characters, and noncharacter code
   points.  On the left-hand side (local-part), for better or for worse,
   the atext production of local-part has been extended in both
   [RFC6531] and [RFC6532] to accept any Unicode character beyond the
   ASCII range.  Therefore, a whole slew of "flotsam" can get validly
   prepended to an internationalized email address.  Apparently the only
   way to "stop" this is to quote the local-part.

   Characters will get inconsistent treatment depending on which end the
   characters appear.  For example: the characters U+FF1C FULLWIDTH
   LESS-THAN SIGN and U+FF1E FULLWIDTH GREATER-THAN SIGN are classified
   as [Sm] aka "Symbol, Math", which are DISALLOWED under [RFC5892].
   Therefore the presence of U+FF1E on the domain end will terminate the
   address (for a properly implemented regular expression), but the
   corresponding presence of U+FF1C on the local-part end will NOT
   terminate the address!  A conforming parser would actually start much
   earlier in a blob of text (possibly as early as the beginning of a
   new line) and match all characters up to the @ delimiter, blowing
   straight through U+FF1C.

3.1.7.  Certain Expressions for Restrictions

   Email addresses incorporate internationalized domain names, so the
   complex and confusing rules of IDNs apply directly to the right-hand
   side of deliverable email addresses.



Leonard, et al.            Expires May 9, 2019                 [Page 18]


Internet-Draft                mail-regexen                 November 2018


   [[TODO: integrate these expressions into the main expressions of
   3.1.2.]] The following expressions apply to an individual sub-domain
   production:

   [[NB: open issue if length should be restricted.  Author believes it
   should be length-restricted, because overlong labels in domain names
   mean the address can't be looked up, and therefore, the address is
   not deliverable.]]

   The following lookahead regular expressions apply on a per-label
   (i.e., per-sub-domain) basis.

 P E J (?=[^.]{1,63}\.|$)
         restricts to 63 characters

 P E J (?=[0-z]{1,63}\.|$)
         restricts to 63 LDH characters

 P E J (?=[0-z\xA0-\x{10FFFF}]{1,56}\.|$)(?!..--)
       (?=[0-z\xA0-\x{10FFFF}]{0,55}[\xA0-\x{10FFFF}][0-z]{0,55}\.|$)
       or
       (?=[0-z\xA0-\x{10FFFF}]{1,56}\.|$)(?!..--)(?![0-z]{1,56}\.|$)
         Enforces that for U-labels (where at least one non-ASCII char
         is present), there cannot be more than 55 chars. The equivalent
         ACE will include xn-- PLUS -2rh, which means 7 extra characters
         (8 chars minus 1 Unicode char). Did a test with U+00DE.
         Also not allowed to have any any HYPHEN HYPHEN
         (Section 4.2.3.1 of [RFC5891]).

   [[NB: Should we permit fullwidth and other dots (not just ASCII
   dot)?]]

   P E J (?![0-9]{1,63}\.|$)
           restricts out all-numeric labels [RFC1912]
           [[TODO: Would it be more accurate to say that the all-numeric
           labels 0-255 are prohibited, but 256+ are permissible?]]

   [[NB: The end-of-address production \.|$ must also include the
   possibility of any number of non-domain name characters, when
   searching through an arbitrary block of text.]]

   [[TODO: the flotsam on the local-part end is potentially a big
   problem, unless we say that deliverable email addresses MUST be
   delimited on the left-hand side by the ASCII character < or other
   well-established characters that are not in dot-atom-text/dot-
   string.]]





Leonard, et al.            Expires May 9, 2019                 [Page 19]


Internet-Draft                mail-regexen                 November 2018


3.1.8.  Unquoting Local-Part

   The following regular expressions can be used to unquote the local-
   part production.

   P sed s/^"(.*)"$/$1/
           Applies when the local-part is isolated from @domain.
           Removes surrounding quotations.

   P sed s/\\(.)/$1/g
           Applies when the local-part is isolated from the surrounding
           quotations. (Safe to use with dot-string aka dot-atom-text,
           since backslashes are not present.) Unquotes quoted-pairs.

   P   J /^"|(?=.+"$)\\(.)|"$/g  (with "$1" replacement string)
   P sed s/^"|(?=.+"$)\\(.)|"$/$1/g
           Applies when local-part is isolated from @domain.
           Removes surrounding quotations and unquotes quoted-pairs
           in one loop by using lookahead.

   [[TODO: add more detailed descriptions of operations.]]

3.1.9.  Quoting Local-Part

   Given a Unicode string that represents the unquoted local-part, the
   following regular expressions can be used to create a quoted
   production.

   J /(?=[^"\]*["\])(["\])/\$1/g
           Quote each and every " and \.
   J /^(?![A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}]+
         (?:\.[A-Za-z0-9!#-'*+\-/=?^_`{-~\xA0-\x{10FFFF}])$).*$/"$&"/
           If the string does not conform to dot-string (including,
           e.g., the presence of consecutive dots), surround the
           entire string with quotations.

3.2.  Modern Email Address

   A modern email address is an email address that conforms to
   [RFC5322], except for the ABNF productions in that standard that are
   marked as "obsolete".  For example, control characters are excluded.
   Modern email addresses are a superset of deliverable email addresses
   [RFC5321].  Since modern email addresses are not necessarily
   deliverable by SMTP, the domain production does not need to conform
   to DNS rules.  This relaxation makes the regular expressions much
   simpler.  On the other hand, modern email addresses permit embedded
   comments and folding whitespace, requiring the use of a pushdown
   automaton.



Leonard, et al.            Expires May 9, 2019                 [Page 20]


Internet-Draft                mail-regexen                 November 2018


(?(DEFINE)
 (?#whitespace)
 (?<FWS>(?:[\t ]*\r\n)?[\t ]+)
 (?<CFWS>(?:(?&FWS)?(?&comment))+(?&FWS)?|(?&FWS))
 (?#local)
 (?<atext>[0-9A-Za-z!#-'*+\-/=?^_`{-~])
 (?<dot_atom_text>(?&atext)+(?:\.(?&atext)+)*)
 (?<ctext>[!-'*-\[\]-~])
 (?<ccontent>(?&ctext)|(?&quoted_pair)|(?&comment))
 (?<comment>\((?:(?&FWS)?(?&ccontent))*(?&FWS)?\))
 (?<qtext>[ !#-\[\]-~])
 (?<quoted_pair>\\[ -~])
 (?<qcontent>(?&qtext)|(?&quoted_pair))
 (?<quoted_string>(?&CFWS)?"(?:(?&FWS)?(?&qcontent))*(?&FWS)?"(?&CFWS)?)
 (?<local_part>(?&CFWS)?(?&dot_atom_text)(?&CFWS)?|(?&quoted_string))
 (?#domain)
 (?<dtext>[!-Z^-~])
 (?<domain>(?&CFWS)?(?:(?&dot_atom_text)|
  \[(?:(?&FWS)?(?&dtext))*(?&FWS)?\])(?&CFWS)?)
)

   A modern email address matches the <addr-spec> production of
   [RFC5322], without any obsolete parts.

   (?&local_part)@(?&domain)

   ^(?&local_part)@(?&domain)$

3.3.  Legacy Email Address

   [[TODO: expand regular expressions.  Arguably, characters beyond
   ASCII need not be included.  Therefore domain should be MUCH simpler:
   the name form should be restricted to LDH-strings.]]

3.4.  Algorithms for Detecting Email Addresses

   As Section 3.1 indicates, compiling and executing a true and correct
   regular expression for an email address (deliverable, valid,
   historic) will be complicated and time-consuming.  More efficient
   algorithms are desirable.

   o  Scan text buffer for "@".

   o  Evaluate characters after "@" for domain production.

   o  Evaluate character prior to "@".





Leonard, et al.            Expires May 9, 2019                 [Page 21]


Internet-Draft                mail-regexen                 November 2018


   o  If prior character is <">, scan backwards for initial (unescaped)
      <">; evaluate characters in between to match quoted-string
      production.

   o  Otherwise if prior character is a valid atext, consume characters
      backwards while evaluating for dot-string.  (The dot-string
      production is palindromic.)

   [[TODO: add other suggested algorithms.]]

   [[Splitting valid email address into local-part and right-hand-
   side.]]

3.5.  Handling Domain Names

   [[TODO: discuss the issues with handling domain names.]]

4.  Regular Expressions for Message-IDs

   Message-ID values form a disjoint set from email address values,
   i.e., a Message-ID that also happens to be an email address is just a
   coincidence.

   The productions that comprise Message-ID are called id-left and id-
   right.

4.1.  Modern Message-ID

   A modern Message-ID is one that complies with the strict generation
   rules of [RFC5322].  In particular: id-left is only dot-atom-text (as
   amended by [RFC6532], and id-right is dot-atom-text or no-fold-
   literal (as amended by [RFC6532]).  Notably, virtually any Unicode
   scalar value is permissible in id-right, because [RFC6532] does not
   import U-label (unlike [RFC6531]).  The resulting regular expressions
   will therefore be more expansive, at the cost of accepting
   characters, such as fullwidth punctuation, that would otherwise
   delimit Message-IDs on both ends in text.

   The regular expressions reuse many of the subroutines of Section 3.1.
   [[POINTER: obs-id-left and obs-id-right are supersets of their modern
   forms, so deliverable email address regular expressions may well be
   reused directly.]]









Leonard, et al.            Expires May 9, 2019                 [Page 22]


Internet-Draft                mail-regexen                 November 2018


   (?(DEFINE)
    (?#id-left)
    (?<atext>[0-9A-Za-z!#-'*+\-/=?^_`{-~])
    (?<dot_atom_text>(?&atext)+(?:\.(?&atext)+)*)
    (?<id_left>(?&dot_atom_text))
    (?#id-right)
    (?<dtext>[!-Z^-~])
    (?<id_right>(?&dot_atom_text)|\[(?&dtext)*\])
   )

   A modern Message-ID matches the <addr-spec> production of [RFC5322],
   without any obsolete parts.

   (?&id_left)@(?&id_right)

   ^(?&id_left)@(?&id_right)$

4.2.  General Message-ID

   A "general Message-ID" is one that complies with any of the mail
   rules.

   [[TODO: complete.]]

5.  Security Considerations

   Internet Mail identifiers are important for identifying users and
   other principals in security systems.  While a user's login token and
   an email address are formally separate entities, many common
   Internet-connected systems conflate the two.  Systems that accept
   email addresses as login tokens (in particular, other systems' email
   addresses, rather than their own) SHALL accept the full range of
   valid email addresses.  To do otherwise is to act as a denial of
   service against legitimate users with legitimate mailbox names.

   When a user forgets his or her password or other login credentials,
   the most common recovery method on the Internet is to send a recovery
   message to the user's registered [[email address]].  Preventing users
   from using their chosen or assigned addresses acts as a denial of
   service.

   Because a local-part can contain almost any Unicode scalar value,
   security issues are essentially pushed from clients to servers and to
   registration processes.  I.e., it is up to a server implementation to
   decide whether to accept an arbitrary Unicode string for registration
   or for delivery with SMTP, folding or normalizing input at its
   discretion.  A robust server implementation needs to handle arbitrary
   input gracefully.



Leonard, et al.            Expires May 9, 2019                 [Page 23]


Internet-Draft                mail-regexen                 November 2018


   In contrast, when a domain part represents a domain name, the string
   is severely restricted by the IDNA documents [RFC5890] et. seq.  A
   server is perfectly within its rights to reject input that is not in
   NFC or contains disallowed characters.  Specifically, since the
   domain part is the key to retrieving the MX resource record, the
   Internet Mail standards hardly get involved.  These restrictions put
   more onus on clients to validate strings.  However, integrating the
   entire static list of [RFC5892] into regular expressions would be
   unduly burdensome on many implementations.  While an implementation
   can consider using character classes, the risks and benefits of using
   character classes need to be carefully considered.

   Character classes represent a shorthand for certain ranges of
   characters based on Unicode properties.  Effectively the total
   system's state table remains the same, but the complexity is pushed
   from one component (the regular expression definition) to another
   component (the table representing the character classes).
   Ultimately, the regular expression character classes in popular
   formulations will derive from the Unicode Standard [UNICODE]
   definitions such as UnicodeData.txt.  Since a proper IDNA-enabled DNS
   library needs to keep track of these character classes anyway,
   referencing these ranges by character classes should not add much to
   the image size.  However, now the regular expression (which
   previously was self-contained) now has an external dependency to data
   that may-and probably will-frequently change, including (for example)
   the ranges of unassigned code points.  What could have been a valid
   email address one month may be invalid the next month, or vice-versa,
   simply depending on the version of the regular expression or DNS
   library that an implementation depends on.  Implementers should
   therefore evaluate their own needs for security and stability in
   picking particular regular expression forms.

6.  References

6.1.  Normative References

   [I-D.seantek-abnf-more-core-rules]
              Leonard, S., "Comprehensive Core Rules and References for
              ABNF", draft-seantek-abnf-more-core-rules-07 (work in
              progress), September 2016.

   [JSRE6ED]  Ecma International, "ECMAScript 2015 Language
              Specification", Standard ECMA-262, 6th Edition , June
              2015, <http://www.ecma-international.org/ecma-262/6.0/>.

   [PCRE2]    Hazel, P., "Perl Compatible Regular Expressions 2, version
              10.21", January 2016, <http://www.pcre.org/>.




Leonard, et al.            Expires May 9, 2019                 [Page 24]


Internet-Draft                mail-regexen                 November 2018


   [POS-ERE]  IEEE Std 1003.1, 2013 Edition (incorporates IEEE Std
              1003.1-2008 and IEEE Std 1003.1-2008/Cor 1-2013),
              ""Standard for Information Technology - Portable Operating
              System Interface (POSIX(R)) Base Specifications, Issue 7"
              (incorporating Technical Corrigendum 1), Section 9.4,
              "Extended Regular Expressions"", April 2013,
              <http://pubs.opengroup.org/onlinepubs/9699919799/
              basedefs/V1_chap09.html>.

   [RFC0821]  Postel, J., "Simple Mail Transfer Protocol", STD 10, RFC
              821, DOI 10.17487/RFC0821, August 1982,
              <http://www.rfc-editor.org/info/rfc821>.

   [RFC0822]  Crocker, D., "STANDARD FOR THE FORMAT OF ARPA INTERNET
              TEXT MESSAGES", STD 11, RFC 822, DOI 10.17487/RFC0822,
              August 1982, <http://www.rfc-editor.org/info/rfc822>.

   [RFC0973]  Mockapetris, P., "Domain system changes and observations",
              RFC 973, DOI 10.17487/RFC0973, January 1986,
              <http://www.rfc-editor.org/info/rfc973>.

   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
              STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
              <http://www.rfc-editor.org/info/rfc1034>.

   [RFC1035]  Mockapetris, P., "Domain names - implementation and
              specification", STD 13, RFC 1035, DOI 10.17487/RFC1035,
              November 1987, <http://www.rfc-editor.org/info/rfc1035>.

   [RFC1123]  Braden, R., Ed., "Requirements for Internet Hosts -
              Application and Support", STD 3, RFC 1123, DOI 10.17487/
              RFC1123, October 1989,
              <http://www.rfc-editor.org/info/rfc1123>.

   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
              Extensions (MIME) Part One: Format of Internet Message
              Bodies", RFC 2045, DOI 10.17487/RFC2045, November 1996,
              <http://www.rfc-editor.org/info/rfc2045>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2821]  Klensin, J., Ed., "Simple Mail Transfer Protocol", RFC
              2821, DOI 10.17487/RFC2821, April 2001,
              <http://www.rfc-editor.org/info/rfc2821>.






Leonard, et al.            Expires May 9, 2019                 [Page 25]


Internet-Draft                mail-regexen                 November 2018


   [RFC2822]  Resnick, P., Ed., "Internet Message Format", RFC 2822, DOI
              10.17487/RFC2822, April 2001,
              <http://www.rfc-editor.org/info/rfc2822>.

   [RFC5321]  Klensin, J., "Simple Mail Transfer Protocol", RFC 5321,
              DOI 10.17487/RFC5321, October 2008,
              <http://www.rfc-editor.org/info/rfc5321>.

   [RFC5322]  Resnick, P., Ed., "Internet Message Format", RFC 5322,
              October 2008.

   [RFC5890]  Klensin, J., "Internationalized Domain Names for
              Applications (IDNA): Definitions and Document Framework",
              RFC 5890, DOI 10.17487/RFC5890, August 2010,
              <http://www.rfc-editor.org/info/rfc5890>.

   [RFC5891]  Klensin, J., "Internationalized Domain Names in
              Applications (IDNA): Protocol", RFC 5891, DOI 10.17487/
              RFC5891, August 2010,
              <http://www.rfc-editor.org/info/rfc5891>.

   [RFC5892]  Faltstrom, P., Ed., "The Unicode Code Points and
              Internationalized Domain Names for Applications (IDNA)",
              RFC 5892, DOI 10.17487/RFC5892, August 2010,
              <http://www.rfc-editor.org/info/rfc5892>.

   [RFC6530]  Klensin, J. and Y. Ko, "Overview and Framework for
              Internationalized Email", RFC 6530, DOI 10.17487/RFC6530,
              February 2012, <http://www.rfc-editor.org/info/rfc6530>.

   [RFC6531]  Yao, J. and W. Mao, "SMTP Extension for Internationalized
              Email", RFC 6531, DOI 10.17487/RFC6531, February 2012,
              <http://www.rfc-editor.org/info/rfc6531>.

   [RFC6532]  Yang, A., Steele, S., and N. Freed, "Internationalized
              Email Headers", RFC 6532, February 2012.

   [UNICODE]  The Unicode Consortium, "The Unicode Standard, Version
              9.0.0", August 2016.

6.2.  Informative References

   [MTECHDEV]
              Partridge, J., "The Technical Development of Internet
              Email", IEEE Annals of the History of Computing, Vol. 30,
              No. 2, DOI 10.1109/MAHC.2008.32 , June 2008.





Leonard, et al.            Expires May 9, 2019                 [Page 26]


Internet-Draft                mail-regexen                 November 2018


   [RFC0524]  White, J., "Proposed Mail Protocol", RFC 524, DOI
              10.17487/RFC0524, June 1973,
              <http://www.rfc-editor.org/info/rfc524>.

   [RFC0561]  Bhushan, A., Pogran, K., Tomlinson, R., and J. White,
              "Standardizing Network Mail Headers", RFC 561, DOI
              10.17487/RFC0561, September 1973,
              <http://www.rfc-editor.org/info/rfc561>.

   [RFC0680]  Myer, T. and D. Henderson, "Message Transmission
              Protocol", RFC 680, DOI 10.17487/RFC0680, April 1975,
              <http://www.rfc-editor.org/info/rfc680>.

   [RFC0724]  Crocker, D., Pogran, K., Vittal, J., and D. Henderson,
              "Proposed official standard for the format of ARPA Network
              messages", RFC 724, DOI 10.17487/RFC0724, May 1977,
              <http://www.rfc-editor.org/info/rfc724>.

   [RFC0733]  Crocker, D., Vittal, J., Pogran, K., and D. Henderson,
              "Standard for the format of ARPA network text messages",
              RFC 733, DOI 10.17487/RFC0733, November 1977,
              <http://www.rfc-editor.org/info/rfc733>.

   [RFC0772]  Sluizer, S. and J. Postel, "Mail Transfer Protocol", RFC
              772, DOI 10.17487/RFC0772, September 1980,
              <http://www.rfc-editor.org/info/rfc772>.

   [RFC0882]  Mockapetris, P., "Domain names: Concepts and facilities",
              RFC 882, DOI 10.17487/RFC0882, November 1983,
              <http://www.rfc-editor.org/info/rfc882>.

   [RFC0883]  Mockapetris, P., "Domain names: Implementation
              specification", RFC 883, DOI 10.17487/RFC0883, November
              1983, <http://www.rfc-editor.org/info/rfc883>.

   [RFC1818]  Postel, J., Li, T., and Y. Rekhter, "Best Current
              Practices", RFC 1818, DOI 10.17487/RFC1818, August 1995,
              <http://www.rfc-editor.org/info/rfc1818>.

   [RFC1912]  Barr, D., "Common DNS Operational and Configuration
              Errors", RFC 1912, DOI 10.17487/RFC1912, February 1996,
              <http://www.rfc-editor.org/info/rfc1912>.

   [RFC5598]  Crocker, D., "Internet Mail Architecture", RFC 5598, DOI
              10.17487/RFC5598, July 2009,
              <http://www.rfc-editor.org/info/rfc5598>.





Leonard, et al.            Expires May 9, 2019                 [Page 27]


Internet-Draft                mail-regexen                 November 2018


Appendix A.  Test Vectors

   [[NB: This appendix will include a large set of test vectors to test
   matching and validation patterns.]]

Appendix B.  Change Log

   Draft-03 just updates the date to keep the document active.

   The document status is now marked as Informational instead of Best
   Current Practice (although it seems that it could go either way).

   The authors decided to focus on "modern" ASCII-only email identifiers
   first and to get those right before tackling Unicode email
   identifiers and "obsolete" ABNF productions.  This draft-01 preserves
   the main text of draft-00 rather than remove potentially useful text.
   Deliverable and modern email addresses, and modern Message-IDs, have
   been addressed.  The Unicode work remains unfinished for now;
   "obsolete" ABNF productions (which are still useful for archival
   applications) will also be addressed in future drafts.

   The authors decided to write one set of regular expressions in one
   dialect (namely, PCRE/PCRE2) before tackling others (e.g.,
   JavaScript).  Different dialects will be addressed in future drafts.

Authors' Addresses

   Sean Leonard
   Penango, Inc.
   5900 Wilshire Blvd
   Ste 2600
   Los Angeles, CA  90036
   USA

   Email: dev+ietf@seantek.com


   Joe Hildebrand
   Cisco Systems

   Email: jhildebr@cisco.com










Leonard, et al.            Expires May 9, 2019                 [Page 28]


Internet-Draft                mail-regexen                 November 2018


   Tony Hansen
   AT&T Laboratories
   200 Laurel Ave South
   Middletown, NJ  07748
   USA

   Email: tony@att.com












































Leonard, et al.            Expires May 9, 2019                 [Page 29]


Html markup produced by rfcmarkup 1.129c, available from https://tools.ietf.org/tools/rfcmarkup/