draft-ietf-idn-nameprep-04.txt   draft-ietf-idn-nameprep-05.txt 
Internet Draft Paul Hoffman Internet Draft Paul Hoffman
draft-ietf-idn-nameprep-04.txt IMC & VPNC draft-ietf-idn-nameprep-05.txt IMC & VPNC
July 13, 2001 Marc Blanchet July 19, 2001 Marc Blanchet
Expires in six months ViaGenie Expires in six months ViaGenie
Preparation of Internationalized Host Names Preparation of Internationalized Host Names
Status of this memo Status of this memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026. provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts. may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress." or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at To view the list Internet-Draft Shadow Directories, see
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Abstract Abstract
This document describes how to prepare internationalized host names for This document describes how to prepare internationalized host names for
use in the DNS. The steps include: use in the DNS. The steps are:
- mapping characters to other characters, such as to change their case - mapping characters to other characters, such as to change their case
- normalizing the characters - normalizing the characters
- excluding characters that are prohibited from appearing in - excluding characters that are prohibited from appearing in
internationalized host names internationalized host names
This document does not specify a wire protocol. This preparation should This document does not specify a wire protocol. This preparation should
be done before the DNS request. be done before the DNS request.
1. Introduction 1. Introduction
When expanding today's DNS to include internationalized host names, When expanding today's DNS to include internationalized host names,
those new names will be handled in many parts of the DNS. The those new names will be handled in many parts of the DNS. The
Internationalized Domain Name (IDN) Working Group's requirements Internationalized Domain Name (IDN) Working Group's requirements
document [IDNReq] describes a framework for domain name handling as well document [IDNReq] describes a framework for domain name handling as well
as requirements for the new names. as requirements for the new names.
A user can enter a domain name into an application program in a myriad A user can enter a domain name into an application program in a myriad
of fashions. Depending on the input method, the characters entered in of fashions. Depending on the input method, the characters entered in
the domain name may or may not be those that are allowed in the domain name may or may not be those that are allowed in
internationalized host names. Thus, there must be a way to normalized internationalized host names. Thus, there must be a way to normalize
the user's input before the name is resolved in the DNS. the user's input before the name is resolved in the DNS.
It is a design goal of this document to allow users to enter host names It is a design goal of this document to allow users to enter host names
in applications and have the highest chance of getting the name correct. in applications and have the highest chance of getting the name correct.
Another, often conflicting, design goal is to allow as wide of a range Another, often conflicting, design goal is to allow as wide of a range
of characters as possible to be allowed in host names. The user should of characters as possible in host names. The user should not be limited
not be limited to only entering exactly the characters that might have to only entering exactly the characters that might have been used, but
been used, but to instead be able to enter characters that unambiguously to instead be able to enter characters that unambiguously normalize to
normalize to characters in the desired host name. Although it would be characters in the desired host name. Although it would be easy to use
easy to use the process in this step to "correct" perceived mis-features the process in this step to "correct" perceived mis-features or bugs in
or bugs in the current character standards, this document expressly does the current character standards, this document expressly does not do so.
not do so. A difference between a character standard and this specification does
not imply that the character standard is wrong, simply that the
character standard and this specification have different purposes.
This document describes the steps needed to convert a name part from one This document describes the steps needed to convert a name part from one
that is entered by the user to one that can be used in the DNS. that is entered by the user to one that can be used in the DNS.
Within a fully-qualified domain name, some labels may be Within a fully-qualified domain name, some labels may be
internationalized, while others are not. This specification should be internationalized, while others are not. This specification should be
applied to all internationalized labels. An application must be able to applied to all internationalized labels. An application must be able to
recognize which part is internationalized; the method for such recognize which part is internationalized; the method for such
recognition is outside of the scope of this document. Note that this recognition is outside of the scope of this document. Note that this
specification is harmless to the non-internationalized labels: when the specification is harmless to the non-internationalized labels: when the
skipping to change at line 86 skipping to change at line 85
1.1 Terminology 1.1 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119 "MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119]. [RFC2119].
Examples in this document use the notation for code points and names Examples in this document use the notation for code points and names
from the Unicode Standard [Unicode3.1] and ISO/IEC 10646 [ISO10646]. For from the Unicode Standard [Unicode3.1] and ISO/IEC 10646 [ISO10646]. For
example, the letter "a" may be represented as either "U+0061" or "LATIN example, the letter "a" may be represented as either "U+0061" or "LATIN
SMALL LETTER A". In the lists of prohibited characters, the "U+" is left SMALL LETTER A". In the lists of prohibited characters, the "U+" is left
off to make the lists easier to read. The names of character ranges are off to make the lists easier to read. The comments for character ranges
shown in square brackets (such as "[SYMBOLS]") and do not come from the are shown in square brackets (such as "[SYMBOLS]") and do not come from
standards. the standards.
Note: A glossary of terms used in Unicode and ISO/IEC 10646 can be found Note: A glossary of terms used in Unicode and ISO/IEC 10646 can be found
in [Glossary]. Information on the 10646/Unicode character model can be in [Glossary]. Information on the 10646/Unicode character encoding model
found in [CharModel]. can be found in [CharModel].
2. Preparation Overview 2. Preparation Overview
The steps for preparing names are: The steps for preparing names are:
1) Input from the application service interface -- This can be done in 1) Input from the application service interface -- This can be done in
many ways and is not specified in this document many ways and is not specified in this document
2) Map -- For each character in the input, check if it has a mapping 2) Map -- For each character in the input, check if it has a mapping
and, if so, replace it with its mapping. The mappings are a combination and, if so, replace it with its mapping. The mappings are a combination
of folding uppercase characters to lowercase and hyphen mapping. This is of folding uppercase characters to lowercase and mapping out characters.
described in Section 4. This is described in Section 4.
3) Normalize -- Normalize the characters. This is described in Section 3) Normalize -- Normalize the characters. This is described in Section
5. 5.
4) Look for prohibited output -- Check for any characters that are not 4) Look for prohibited output -- Check for any characters that are not
allowed in the output. If any are found, return an error to the allowed in the output. If any are found, return an error to the
application service interface. This is described in Section 6. application service interface. This is described in Section 6.
5) Resolution of the prepared name -- This must be specified in a 5) Resolution of the prepared name -- This must be specified in a
different IDN document. different IDN document.
The above steps MUST be performed in the order given in order to comply The above steps MUST be performed in the order given to comply with this
with this specification. specification.
The steps in this document have associated tables in the document. The The steps in this document have associated tables in the document. The
tables are derived from outside sources, and the derivation is briefly tables are derived from outside sources, and the derivation is briefly
described in the document. Although a great deal of effort has gone into described in the document. Although a great deal of effort has gone into
preparing the tables, there is a chance that the tables do not correctly preparing the tables, there is a chance that the tables do not correctly
reflect the outside sources. Regardless of whether or not the tables reflect the outside sources. Regardless of whether or not the tables
differ from the sources, implementations MUST use the tables in this differ from the sources, implementations MUST use the tables in this
document for their processing. That is, if there is an error in the document for their processing. That is, if there is an error in the
tables, the tables must still be used. Future versions of this document tables, the tables must still be used. Future versions of this document
may include corrections and additions to the tables. may include corrections and additions to the tables.
The mappings in section 3 can be one-to-none, one-to-one, or
one-to-many. That is, some characters may be eliminated or replaced by
more than one character, and the output of this step might be shorter or
longer than the input. The normalization in section 4 can be one-to-one
or many-to-one. Because of this, the system using nameprep MUST be
prepared to receive a longer or shorter string than the one input in the
nameprep algorithm.
3. Mapping 3. Mapping
Each character in the input stream is checked against the mapping table. Each character in the input stream is checked against the mapping table.
The mapping table can be found in Appendix E of this document. That The mapping table can be found in Appendix E of this document. That
table includes all the steps described in the subsections below. table includes all the steps described in the subsections below.
Note that the subsections below describe how Appendix E was formed. Note that the subsections below describe how Appendix E was formed.
They are there for people who want to understand more, but they should They are there for people who want to understand more, but they should
be ignored by implementors. Nameprep implementations MUST map based be ignored by implementors. Nameprep implementations MUST map based
on Appendix E, not based on the descriptions in this section of how on Appendix E, not based on the descriptions in this section of how
Appendix E was created. Appendix E was created.
The mappings can be one-to-none, one-to-one, or one-to-many. That is,
some characters may be eliminated or replaced by more than one
character, and the output of this step might be shorter or longer than
the input. Because of this, an application MUST be prepared to receive a
longer or shorter string than the one input in the nameprep algorithm.
Rationale: Characters that are not wanted in internationalized name Rationale: Characters that are not wanted in internationalized name
parts can either be mapped to nothing in the mapping step, or cause an parts are either mapped to nothing in the mapping step, or cause an
error in the prohibition step. The general guideline used to pick error in the prohibition step. The general guideline used to pick
between the two outcomes was that removing alphabetic, non-protocol between the two outcomes was that removing alphabetic, non-protocol
characters be done in the mapping step, but all other removals be done characters be done in the mapping step, but all other characters cause
in the prohibition step. This allows for simple linguistic errors on the errors in the prohibition step. This allows for minor linguistic errors
part of an input mechanism to be caught in the mapping step, but to not on the part of an input mechanism to be caught in the mapping step, but
hide serious errors such as entering protocol characters or invisible to not hide serious errors such as entering protocol characters or
characters from the user. invisible characters from the user.
3.1 Case mapping 3.1 Case mapping
The input string is case folded according to [UTR21]. For most The input string is case folded according to [UTR21]. For most
characters, this is the same thing as changing the input character to a characters, this is the same as changing the input character to a
lowercase character. For some characters, however, more complex lowercase character. For some characters, however, more complex
transformations occur. The "CaseFolding.txt" file from the Unicode transformations occur. The "CaseFolding.txt" file from the Unicode
database was used to prepare Appendix E. database was used to prepare Appendix E.
Rationale: This step could have been "change all lowercase characters Rationale: This could have been "change all lowercase characters
into uppercase characters". However, the upper-to-lower folding was into uppercase characters". However, the upper-to-lower folding was
chosen because most users of the Internet today enter host names in chosen because most users of the Internet today enter host names in
lowercase. lowercase.
3.2 Additional folding mappings 3.2 Additional folding mappings
There are some characters that do not have mappings in [UTR21] but still There are some characters that do not have mappings in [UTR21] but still
need processing. These characters include a few Greek characters and need processing. These characters include a few Greek characters and
many symbols that contain Latin characters. The list of characters to many symbols that contain Latin characters. The list of characters to
add to the mapping table were determined by the following algorithm: add to the mapping table were determined by the following algorithm:
b = NormalizeWithKC(Fold(a)); b = NormalizeWithKC(Fold(a));
c = NormalizeWithKC(Fold(b)); c = NormalizeWithKC(Fold(b));
if c is not the same as b, add a mapping for "a to c". if c is not the same as b, add a mapping for "a to c".
Because NormalizeWithKC(Fold(c)) always equals c, the table is stable Because NormalizeWithKC(Fold(c)) always equals c, the table is stable
from that point on. The "DerivedNormalizationProperties.txt" file from from that point on. The "DerivedNormalizationProperties.txt" file from
the Unicode database was used to prepare Appendix E. the Unicode database was used to prepare Appendix E. This mapping was
added to reduce the number of processing steps, that is, to avoid doing
case mapping and normalization twice.
3.3 Mapped out 3.3 Mapped out
The following characters are simply deleted from the input (that is, The following characters are simply deleted from the input (that is,
they are mapped to nothing) because their presence or absence should not they are mapped to nothing) because their presence or absence should not
make two domain names different. make two domain names different.
Some characters are only useful in line-based text, and are otherwise Some characters are only useful in line-based text, and are otherwise
invisible and ignored. invisible and ignored.
skipping to change at line 214 skipping to change at line 217
180D; MONGOLIAN FREE VARIATION SELECTOR THREE 180D; MONGOLIAN FREE VARIATION SELECTOR THREE
200C; ZERO WIDTH NON-JOINER 200C; ZERO WIDTH NON-JOINER
200D; ZERO WIDTH JOINER 200D; ZERO WIDTH JOINER
4. Normalization 4. Normalization
The output of the mapping step is normalized using form KC, as described The output of the mapping step is normalized using form KC, as described
in [UAX15]. Using form KC instead of form C causes many characters that in [UAX15]. Using form KC instead of form C causes many characters that
are identical or near-identical to be converted into a single character. are identical or near-identical to be converted into a single character.
Note that this specification refers to a specific version of [UAX15]. If Note that this specification refers to a specific version of [UAX15]. If
a later version of [UAX15] changes the algorithm used for normalizing, a later version of [UAX15] changes the algorithm used for normalization,
that later version MUST NOT be used with this specification. Note that that later version MUST NOT be used with this specification. Note that
it is likely that this specification will be revised if UAX15 is it is likely that this specification will be revised if UAX15 is
changed, but until that happens, only the specified version of [UAX15] changed, but until that happens, systems compliant with this
must be used. specification MUST use only the specified version of [UAX15].
5. Prohibited Output 5. Prohibited Output
Before the text can be emitted, it must be checked for prohibited code Before the text can be emitted, it must be checked for prohibited code
points. There is a variety of prohibited code points, as described in points. There is a variety of prohibited code points, as described in
this section. this section.
Note that the subsections below describe how Appendix F was formed. Note that the subsections below describe how Appendix F was formed.
They are there for people who want to understand more, but they should They are there for people who want to understand more, but they should
be ignored by implementors. Nameprep implementations MUST map based be ignored by implementors. Nameprep implementations MUST map based
skipping to change at line 243 skipping to change at line 246
names as long as those host names do not cause other problems, such as names as long as those host names do not cause other problems, such as
conflict with other standards. Specifically, experience with current DNS conflict with other standards. Specifically, experience with current DNS
names have shown that there is a desire for host names that include names have shown that there is a desire for host names that include
personal names, company names, and spoken phrases. A goal of this personal names, company names, and spoken phrases. A goal of this
section is to prohibit as few characters that might be used in these section is to prohibit as few characters that might be used in these
contexts as possible. contexts as possible.
The collected list of prohibited code points can be found in Appendix F The collected list of prohibited code points can be found in Appendix F
of this document. The list in Appendix F MUST be used by implementations of this document. The list in Appendix F MUST be used by implementations
of this specification. If there are any discrepancies between the list of this specification. If there are any discrepancies between the list
in Appendix F and subsections below, the list Appendix F always takes in Appendix F and subsections below, the list in Appendix F always takes
precedence. precedence.
Some code points listed in one section would also appear in other Some code points listed in one section would also appear in other
sections. Each code point is only listed once in the table in Appendix sections. Each code point is only listed once in the table in Appendix
F. F.
5.1 Currently-prohibited ASCII characters 5.1 Currently-prohibited ASCII characters
Some of the ASCII characters that are currently prohibited in host names Some of the ASCII characters that are currently prohibited in host names
by [STD13] are also used in protocol elements such as URIs [URI]. The other by [STD13] are also used in protocol elements such as URIs [URI]. The other
characters in the range U+0000 to U+007F that are not currently allowed characters in the range U+0000 to U+007F that are not currently allowed
are also prohibited in host name parts to reserve them for future use in are also prohibited in host name parts to reserve them for future use in
protocol elements. protocol elements.
0000-002C; [ASCII] 0000-002C; [ASCII CONTROL CHARACTERS and SPACE through ,]
002E-002F; [ASCII] 002E-002F; [ASCII . through /]
003A-0040; [ASCII] 003A-0040; [ASCII : through @]
005B-0060; [ASCII] 005B-0060; [ASCII [ through `]
007B-007F; [ASCII] 007B-007F; [ASCII { through DEL]
5.2 Space characters 5.2 Space characters
Space characters would make visual transcription of URLs nearly Space characters would make visual transcription of URLs nearly
impossible and could lead to user entry errors in many ways. impossible and could lead to user entry errors in many ways.
0020; SPACE 0020; SPACE
00A0; NO-BREAK SPACE 00A0; NO-BREAK SPACE
1680; OGHAM SPACE MARK 1680; OGHAM SPACE MARK
2000; EN QUAD 2000; EN QUAD
skipping to change at line 288 skipping to change at line 291
2006; SIX-PER-EM SPACE 2006; SIX-PER-EM SPACE
2007; FIGURE SPACE 2007; FIGURE SPACE
2008; PUNCTUATION SPACE 2008; PUNCTUATION SPACE
2009; THIN SPACE 2009; THIN SPACE
200A; HAIR SPACE 200A; HAIR SPACE
202F; NARROW NO-BREAK SPACE 202F; NARROW NO-BREAK SPACE
3000; IDEOGRAPHIC SPACE 3000; IDEOGRAPHIC SPACE
5.3 Control characters 5.3 Control characters
Control characters cannot be seen and can cause unpredictable results Control characters (or characters with control function) cannot be seen
when displayed. and can cause unpredictable results when displayed.
0000-001F; [CONTROL CHARACTERS] 0000-001F; [CONTROL CHARACTERS]
007F; DELETE 007F; DELETE
0080-009F; [CONTROL CHARACTERS] 0080-009F; [CONTROL CHARACTERS]
070F; SYRIAC ABBREVIATION MARK
180E; MONGOLIAN VOWEL SEPARATOR
2028; LINE SEPARATOR 2028; LINE SEPARATOR
2029; PARAGRAPH SEPARATOR 2029; PARAGRAPH SEPARATOR
206A-206F; [CONTROL CHARACTERS]
The following characters have been reserved for future use as control FFF9-FFFC; [CONTROL CHARACTERS]
characters, and are therefore prohibited now even though some of them 1D173-1D17A; [MUSICAL CONTROL CHARACTERS]
are not yet encoded.
2060-206F; [CONTROL CHARACTERS]
FFF0-FFFC; [CONTROL CHARACTERS]
E0000-E0FFF; [CONTROL CHARACTERS]
5.4 Private use and replacement characters 5.4 Private use and replacement characters
Because private-use characters do not have defined meanings, they are Because private-use characters do not have defined meanings, they are
prohibited. The private-use characters are: prohibited. The private-use characters are:
E000-F8FF; [PRIVATE USE, PLANE 0] E000-F8FF; [PRIVATE USE, PLANE 0]
F0000-FFFFD; [PRIVATE USE, PLANE 15] F0000-FFFFD; [PRIVATE USE, PLANE 15]
100000-10FFFD; [PRIVATE USE, PLANE 16] 100000-10FFFD; [PRIVATE USE, PLANE 16]
The replacement character (U+FFFD) has no known semantic definition in a The replacement character (U+FFFD) has no known semantic definition in a
name, and is often displayed by renderers to indicate "there would be some name, and is often displayed by renderers to indicate "there would be
character here, but it cannot be rendered". For example, on a computer some character here, but it cannot be rendered". For example, on a
with no Asian fonts, a name with three katakana characters might be computer with no Asian fonts, a name with three ideographs might be
rendered with three replacement characters. rendered with three replacement characters.
FFFD; REPLACEMENT CHARACTER FFFD; REPLACEMENT CHARACTER
5.5 Non-character code points 5.5 Non-character code points
Non-character code points are code points that have been assigned in Non-character code points are code points that have been allocated in
ISO/IEC 10646 but are not characters. Because they are already assigned, ISO/IEC 10646 but are not characters. Because they are already assigned,
they are guaranteed not to later change into characters. they are guaranteed not to later change into characters.
FDD0-FDEF; [NONCHARACTER CODE POINTS] FDD0-FDEF; [NONCHARACTER CODE POINTS]
FFFE-FFFF; [NONCHARACTER CODE POINTS] FFFE-FFFF; [NONCHARACTER CODE POINTS]
1FFFE-1FFFF; [NONCHARACTER CODE POINTS] 1FFFE-1FFFF; [NONCHARACTER CODE POINTS]
2FFFE-2FFFF; [NONCHARACTER CODE POINTS] 2FFFE-2FFFF; [NONCHARACTER CODE POINTS]
3FFFE-3FFFF; [NONCHARACTER CODE POINTS] 3FFFE-3FFFF; [NONCHARACTER CODE POINTS]
4FFFE-4FFFF; [NONCHARACTER CODE POINTS] 4FFFE-4FFFF; [NONCHARACTER CODE POINTS]
5FFFE-5FFFF; [NONCHARACTER CODE POINTS] 5FFFE-5FFFF; [NONCHARACTER CODE POINTS]
skipping to change at line 347 skipping to change at line 347
8FFFE-8FFFF; [NONCHARACTER CODE POINTS] 8FFFE-8FFFF; [NONCHARACTER CODE POINTS]
9FFFE-9FFFF; [NONCHARACTER CODE POINTS] 9FFFE-9FFFF; [NONCHARACTER CODE POINTS]
AFFFE-AFFFF; [NONCHARACTER CODE POINTS] AFFFE-AFFFF; [NONCHARACTER CODE POINTS]
BFFFE-BFFFF; [NONCHARACTER CODE POINTS] BFFFE-BFFFF; [NONCHARACTER CODE POINTS]
CFFFE-CFFFF; [NONCHARACTER CODE POINTS] CFFFE-CFFFF; [NONCHARACTER CODE POINTS]
DFFFE-DFFFF; [NONCHARACTER CODE POINTS] DFFFE-DFFFF; [NONCHARACTER CODE POINTS]
EFFFE-EFFFF; [NONCHARACTER CODE POINTS] EFFFE-EFFFF; [NONCHARACTER CODE POINTS]
FFFFE-FFFFF; [NONCHARACTER CODE POINTS] FFFFE-FFFFF; [NONCHARACTER CODE POINTS]
10FFFE-10FFFF; [NONCHARACTER CODE POINTS] 10FFFE-10FFFF; [NONCHARACTER CODE POINTS]
The non-character code points are listed the PropList.txt file from the
Unicode database.
5.6 Surrogate codes 5.6 Surrogate codes
The following code points are permanently reserved for use as surrogate The following code points are permanently reserved for use as surrogate
code values in the UTF-16 encoding, will never be assigned to code values in the UTF-16 encoding, will never be assigned to
characters, and are therefore prohibited: characters, and are therefore prohibited:
D800-DFFF; [SURROGATE CODES] D800-DFFF; [SURROGATE CODES]
5.7 Inappropriate for plain text 5.7 Inappropriate for plain text
skipping to change at line 370 skipping to change at line 373
FFFA; INTERLINEAR ANNOTATION SEPARATOR FFFA; INTERLINEAR ANNOTATION SEPARATOR
FFFB; INTERLINEAR ANNOTATION TERMINATOR FFFB; INTERLINEAR ANNOTATION TERMINATOR
FFFC; OBJECT REPLACEMENT CHARACTER FFFC; OBJECT REPLACEMENT CHARACTER
5.8 Inappropriate for domain names 5.8 Inappropriate for domain names
The ideographic description characters allow different sequences of The ideographic description characters allow different sequences of
characters to be rendered the same way, which makes them inappropriate characters to be rendered the same way, which makes them inappropriate
for host names that must have a single canonical representation. for host names that must have a single canonical representation.
2FF0-2FFF; [IDEOGRAPHIC DESCRIPTION CHARACTERS] 2FF0-2FFB; [IDEOGRAPHIC DESCRIPTION CHARACTERS]
5.9 Change display properties 5.9 Change display properties
The following characters, some of which are deprecated in ISO/IEC 10646, The following characters, some of which are deprecated in ISO/IEC 10646,
can cause changes in display or the order in which characters appear can cause changes in display or the order in which characters appear
when rendered. when rendered.
200E; LEFT-TO-RIGHT MARK 200E; LEFT-TO-RIGHT MARK
200F; RIGHT-TO-LEFT MARK 200F; RIGHT-TO-LEFT MARK
202A; LEFT-TO-RIGHT EMBEDDING 202A; LEFT-TO-RIGHT EMBEDDING
skipping to change at line 405 skipping to change at line 408
particularly in Asia. This prohibition allows input mechanisms to safely particularly in Asia. This prohibition allows input mechanisms to safely
map U+3002 to U+002E before doing nameprep without worrying about map U+3002 to U+002E before doing nameprep without worrying about
preventing users from accessing legitimate host name parts. preventing users from accessing legitimate host name parts.
3002; IDEOGRAPHIC FULL STOP 3002; IDEOGRAPHIC FULL STOP
5.11 Tagging characters 5.11 Tagging characters
The following characters are used for tagging text and are invisible. The following characters are used for tagging text and are invisible.
E0000-E007F; [TAGGING CHARACTERS] E0001; LANGUAGE TAG
E0020-E007F; [TAGGING CHARACTERS]
6. Unassigned Code Points 6. Unassigned Code Points
All code points not assigned in [Unicode3.1] are called "unassigned code All code points not assigned in [Unicode3.1] are called "unassigned code
points". Authoritative name servers MUST NOT have internationalized name points". Authoritative name servers MUST NOT have internationalized name
parts that contain any unassigned code points. DNS requests MAY contain parts that contain any unassigned code points. DNS requests MAY contain
name parts that contain unassigned code points. Note that this is the name parts that contain unassigned code points. Note that this is the
only part of this document where the requirements for queries differs only part of this document where the requirements for queries differs
from the requirements for names in DNS zones. from the requirements for names in DNS zones.
Note: For this section, Unicode 3.1 is the base repertoire of unassigned Note: For this section, Unicode 3.1 is the base repertoire of unassigned
code points. The reason Unicode 3.1 was chosen instead of a version of code points. The reason Unicode 3.1 was chosen instead of a version of
ISO/IEC 10646 is that ISO/IEC 10646 is expected to be updated soon after ISO/IEC 10646 is that ISO/IEC 10646 is expected to be updated soon after
this document becomes an RFC. Unicode 3.1 has the exact repertoire that this document becomes an RFC. Unicode 3.1 has the exact repertoire that
is expected in the next version of ISO/IEC 10646, and is therefore used is expected in the next version of ISO/IEC 10646, and is therefore used
here. here.
Using two different policies for where unassigned code points can appear Using two different policies for where unassigned code points can appear
in the DNS prevents the need for versioning the IDN protocol [IDNrev]. in the DNS prevents the need for versioning the IDN protocol [IDNrev].
This is very useful since it makes the overall processing simpler and do This is very useful since it makes the overall processing simpler and
not impose a "protocol" to handle versioning. It is expected that does not impose a "protocol" to handle versioning. It is expected that
ISO/IEC 10646 will be updated fairly frequently; recently, ISO/IEC 10646 will be updated fairly frequently; recently, it has
it has happened approximately once a year. Each time a new version of happened approximately once a year. Each time a new version of ISO/IEC
ISO/IEC 10646 appears, a new version of this document can be 10646 appears, a new version of this document can be created. Some end
created. Some end users will want to use the new code points as soon as users will want to use the new code points as soon as they are defined.
they are defined.
The list of unassigned code points can be found in Appendix G of this The list of unassigned code points can be found in Appendix G of this
document. The list in Appendix G MUST be used by implementations of this document. The list in Appendix G MUST be used by implementations of this
specification. If there are any discrepancies between the list in specification. If there are any discrepancies between the list in
Appendix G and the Unicode 3.1 specification, the list Appendix G Appendix G and the Unicode 3.1 specification, the list Appendix G
always takes precedence. always takes precedence.
Due to the way that versioning is handled in this section, host names Due to the way that versioning is handled in this section, host names
that are embedded in structures that cannot be changed (such as the that are embedded in structures that cannot be changed (such as the
signed parts of digital certificates) MUST NOT have internationalized signed parts of digital certificates) MUST NOT have internationalized
name parts that contain any unassigned code points. name parts that contain any unassigned code points.
6.1 Categories of code points 6.1 Categories of code points
Each code point in ISO/IEC 10646 can be categorized by how it acts in the Each code point in ISO/IEC 10646 can be categorized by how it acts in the
process described in earlier sections of this document: process described in earlier sections of this document:
AO Code points that may be in the output AO Code points that may be in the output
MN Code points that cannot be in the output because they are MN Code points that cannot be in the output because they are
mapped to nothing or never appear as output from never appear as output from mapping or normalization
normalization
D Code points that cannot be in the output because they are D Code points that cannot be in the output because they are
disallowed in the prohibition step disallowed in the prohibition step
U Unassigned code points U Unassigned code points
A subsequent version of this document that references a newer version of A subsequent version of this document that references a newer version of
ISO/IEC 10646 with new code points will inherently have some code points ISO/IEC 10646 with new code points will inherently have some code points
move from category U to either D, MN, or AO. For backwards move from category U to either D, MN, or AO. For backwards
compatibility, no future version of this document will move code points compatibility, no future version of this document will move code points
from any other category. That is, no current AO, MN, or D code points from any other category. That is, no current AO, MN, or D code points
will ever change to a different category. will ever change to a different category.
Authoritative name servers MUST NOT contain any name that has code Authoritative name servers MUST NOT contain any name that has code
points outside of AO for the latest version of this document. That is, points outside of AO for the latest version of this document. That is,
they are forbidden to contain any IDN names containing code points from they are forbidden to contain any IDN names containing code points from
the MN, D, or U categories. the MN, D, or U categories.
Applications creating name queries MUST treat U code points as if they Applications creating name queries MUST treat U code points as if they
were AO when preparing the name parts according to this document. Those were AO when preparing the name parts according to this document. Those
applications MAY optionally have a preprocess that provide stricter applications MAY optionally have a preprocessor that provide stricter
checks: treating unassigned code points in the input as errors, or checks: treating unassigned code points in the input as errors, or
warning the user about the fact that the code point is unassigned in the warning the user about the fact that the code point is unassigned in the
version of this document that the software is based on; such a choice is version of this document that the software is based on; such a choice is
a local matter for the software. a local matter for the software.
Non-authoritative DNS servers MAY reject names that contain code points Non-authoritative DNS servers MAY reject queries that include name parts
that are in categories MN or D for the version of this document that containing code points that are in categories MN or D for the version of
they implement, but MUST NOT reject names because they contain name this document that they implement, but MUST NOT reject queries that
parts with code points from category U. include name parts only for the reason that those parts contain code
points from category U.
6.2 Reasons for difference between authoritative servers and requests 6.2 Reasons for difference between authoritative servers and requests
Different software using different versions of this document need to Different software using different versions of this document need to
interoperate with maximal compatibility. The scheme described in this interoperate with maximal compatibility. The scheme described in this
section (authoritative name servers MUST NOT use unassigned code points, section (authoritative name servers MUST NOT use unassigned code points,
requests MAY include unassigned code points) allows that compatibility requests MAY include unassigned code points) allows that compatibility
without introducing any known security or interoperability issues. without introducing any known security or interoperability issues.
The list below shows what happens if a request contains a code point The list below shows what happens if a request contains a code point
from category U that is allowed in a newer version of this document. The from category U that is allowed in a newer version of this document. The
request either resolves to the domain name that was intended, or request either resolves to the domain name that was intended, or
resolves to no domain at all. In this list, the request comes from an resolves to no domain at all. In this list, the request comes from an
application using version "oldVersion" of this document, the application using version "oldVersion" of this document, the
authoritative name server is using version "newVersion" of this authoritative name server is using version "newVersion" of this
document, and the code point X was in category U on oldVersion, and has document, and the code point X was in category U on oldVersion, and has
changed category to AO, MN, or D. There are 3 possible scenarios: changed category to AO, MN, or D. There are 3 possible scenarios:
1. X becomes AO -- In newVersion, X is in category AO. Because the 1. X is assigned to AO -- In newVersion, X is in category AO. Because
application passed X through, it gets back correct data from the the application passed X through, it gets back correct data from the
authoritative name server. There is one exceptional case, where X is a authoritative name server. There is one exceptional case, where X is a
combining mark. combining mark.
The order of combining marks is normalized, so if another combining mark The order of combining marks is normalized, so if another combining mark
Y has a lower combining class than X then XY will be put in the Y has a lower combining class than X then XY will be put in the
canonical order YX. (Unassigned code points are never reordered, so this canonical order YX. (Unassigned code points are never reordered, so this
doesn't happen in oldVersion). If the request contains YX, the request doesn't happen in oldVersion). If the request contains YX, the request
will get correct data from the authoritative name server. However, no will get correct data from the authoritative name server. However, no
domain name can be registered with XY, so a request with XY will get a domain name can be registered with XY, so a request with XY will get a
"no such host" error. "no such host" error.
2. X becomes MN -- In newVersion, X is normalized to code point "nX" and 2. X is assigned to MN -- In newVersion, X is normalized to code point
therefore X is now put in category MN. This cannot exist in any domain "nX" and therefore X is now put in category MN. This cannot exist in any
name, so any request containing X will get back a "no such host" error. domain name, so any request containing X will get back a "no such host"
Note, however, if the request had contained the letter nX, it would have error. Note, however, if the request had contained the letter nX, it
gotten back correct data. would have gotten back correct data.
3. X becomes D -- In newVersion, X is in category MN. This cannot exist 3. X is assigned to D -- In newVersion, X is in category D. This cannot
in any domain name, so any request containing X will get back a "no such exist in any domain name, so any request containing X will get back a
host" error. "no such host" error.
In none of the cases does the request get data for a host name other In none of the cases does the request get data for a host name other
than the one it actually wanted. than the one it actually wanted.
The processing in this document is always stable. If a string S is the The processing in this document is always stable. If a string S is the
result of processing on newVersion, then it will remain the same when result of processing on newVersion, then it will remain the same when
processed on oldVersion. processed on oldVersion.
There is always a way for the application to get the correct data from There is always a way for the application to get the correct data from
the authoritative name server. For example, suppose that <ALPHA> was the authoritative name server. For example, suppose that <ALPHA> was
skipping to change at line 548 skipping to change at line 551
running newVersion can pass a processed host name to the application running newVersion can pass a processed host name to the application
running oldVersion. It will only contain <alpha>, and will return the running oldVersion. It will only contain <alpha>, and will return the
correct results from the authoritative name server. correct results from the authoritative name server.
6.3 Versions of applications and authoritative name servers 6.3 Versions of applications and authoritative name servers
Another way to see that this versioning system works is to compare what Another way to see that this versioning system works is to compare what
happens when an application uses a newer or older version of this happens when an application uses a newer or older version of this
document. document.
Newer application -- Suppose that a application or intermediary DNS Newer application -- Suppose that an application or intermediary DNS
server is using version newVersion and the authoritative name server is server is using version newVersion and the authoritative name server is
using version oldVersion. This case is simple: there will be no names on using version oldVersion. This case is simple: there will be no names on
the server that cannot be accessed by the application because the the server that cannot be accessed by the application because the
resolver uses a superset of the code points accepted by the server. resolver uses a superset of the code points accepted by the server.
Newer server -- Suppose that an application or intermediary DNS server Newer server -- Suppose that an application or intermediary DNS server
is using oldVersion and the authoritative name server is using is using oldVersion and the authoritative name server is using
newVersion. Because the application passed through any unassigned code newVersion. Because the application passed through any unassigned code
points, the user can access names on the server that use code points in points, the user can access names on the server that use code points in
newVersion. No names on the site can have code points that are newVersion. No names on the site can have code points that are
skipping to change at line 584 skipping to change at line 587
Current applications may assume that the characters allowed in host Current applications may assume that the characters allowed in host
names will always be the same as they are in [STD13]. This document names will always be the same as they are in [STD13]. This document
vastly increases the number of characters available in host names. Every vastly increases the number of characters available in host names. Every
program that uses "special" characters in conjunction with host names program that uses "special" characters in conjunction with host names
may be vulnerable to attack based on the new characters allowed by this may be vulnerable to attack based on the new characters allowed by this
specification. specification.
8. References 8. References
[CharModel] Unicode Technical Report;17, Character Model. [CharModel] Unicode Technical Report;17, Character Encoding Model.
<http://www.unicode.org/unicode/reports/tr17/>. <http://www.unicode.org/unicode/reports/tr17/>.
[Glossary] Unicode Glossary, <http://www.unicode.org/glossary/>. [Glossary] Unicode Glossary, <http://www.unicode.org/glossary/>.
[IDNReq] Zita Wenzel and James Seng, "Requirements of Internationalized [IDNReq] Zita Wenzel and James Seng, "Requirements of Internationalized
Domain Names", draft-ietf-idn-requirements Domain Names", draft-ietf-idn-requirements
[IDNRev] Marc Blanchet, "Handling versions of internationalized domain [IDNRev] Marc Blanchet, "Handling versions of internationalized domain
names protocols", draft-ietf-idn-version names protocols", draft-ietf-idn-version
skipping to change at line 621 skipping to change at line 624
[STD13] Paul Mockapetris, "Domain names - concepts and facilities" (RFC [STD13] Paul Mockapetris, "Domain names - concepts and facilities" (RFC
1034) and "Domain names - implementation and specification" (RFC 1035, 1034) and "Domain names - implementation and specification" (RFC 1035,
STD 13, November 1987. STD 13, November 1987.
[Unicode3.1] The Unicode Standard, Version 3.1.0: The Unicode [Unicode3.1] The Unicode Standard, Version 3.1.0: The Unicode
Consortium. The Unicode Standard, Version 3.0. Reading, MA, Consortium. The Unicode Standard, Version 3.0. Reading, MA,
Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5, as amended Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5, as amended
by: Unicode Standard Annex #27: Unicode 3.1 by: Unicode Standard Annex #27: Unicode 3.1
<http://www.unicode.org/unicode/reports/tr27/tr27-4.html>. <http://www.unicode.org/unicode/reports/tr27/tr27-4.html>.
[URIs] For example: Roy Fielding et. al., "Uniform Resource Identifiers: [URI] For example: Roy Fielding et al., "Uniform Resource Identifiers:
Generic Syntax", August 1998, RFC 2396; Robert Hinden et. al, "IPv6 Generic Syntax", August 1998, RFC 2396; Robert Hinden et. al, "IPv6
Literal Addresses in URL's", December 1999, RFC 2732. Literal Addresses in URL's", December 1999, RFC 2732. Note that
there are many other RFCs that define additional URI schemes.
[UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15: [UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15:
Unicode Normalization Forms, Version 3.1.0. Unicode Normalization Forms, Version 3.1.0.
<http://www.unicode.org/unicode/reports/tr15/tr15-21.html> <http://www.unicode.org/unicode/reports/tr15/tr15-21.html>
[UTR21] Mark Davis. Case Mappings. Unicode Technical Report;21. [UTR21] Mark Davis. Case Mappings. Unicode Technical Report;21.
<http://www.unicode.org/unicode/reports/tr21/>. <http://www.unicode.org/unicode/reports/tr21/>.
A. Acknowledgements A. Acknowledgements
skipping to change at line 658 skipping to change at line 662
Martin Duerst Martin Duerst
Patrik Faltstrom Patrik Faltstrom
Paul Hoffman Paul Hoffman
Additional significant improvements were proposed by: Additional significant improvements were proposed by:
Jonathan Rosenne Jonathan Rosenne
Kent Karlsson Kent Karlsson
Scott Hollenbeck Scott Hollenbeck
B. Differences Between -03 and -04 Drafts B. Differences Between -04 and -05 Drafts
Throughout: updated references from Unicode 3.0 to Unicode 3.1.
3: Added the second paragraph explaining the purpose of the explanations
in the section.
3.1: Changed the first paragraph to describe the use of the Small editorial changes throughout.
"CaseFolding.txt" file.
3.2: Added the description of the use of the 1: Added sentence at end of third paragraph.
DerivedNormalizationProperties.txt file to the end of the section.
4: Changed the references from "[UTR15]" to "[UAX15]". 2: Took paragraph from section 3 and moved it to the end of section 2
with a few changes.
5: Added the second paragraph explaining the purpose of the explanations 3.2: Added sentence at end of last paragraph.
in the section.
5.2: Sorted the list. Removed 200B from prohibited list because it is 4: Clarified last sentence.
already mapped out in section 3.3; this causes no change to the list of
characters allowed in IDN name parts.
5.3: Added three ranges that are reserved for future control character 5.1: Added detail to ASCII ranges.
use.
5.5: Added FDD0-FDEF to the list. 5.3: Added "(or characters with control function)"
Added:
070F
180E
206A-206F
FFF9-FFFC
1D173-1D17A
Removed the "future" control characters.
5.8: Changed "order" to "representation" in the last sentence. 5.5: Added note about properties list at the end of the section.
5.11: Added this section of prohibited characters. 5.8: Narrowed the range to 2FF0-2FFB because that is all that is
currently assigned.
6: Changed this section to point to Unicode 3.1 instead of ISO/IEC 10646 5.11: Changed the range to E0001 and E0020-E007F.
due to timing reasons for the repertoire.
8: Changed the reference for [UAX15] to a specific version. Changed the 6.1: Clarified definition of MN. Clarified last paragraph.
reference for [Unicode3] to [Unicode3.1] and changed the title and URL
for this specific version.
E: Updated the table for changes in section 3, which reflects the 6.2, step 3: Corrected first sentence to say "D" instead of "MN".
changes in Unicode 3.1.
F: Removed 200B from prohibited table because it is already mapped out 8: Fixed title of [CharModel]. Added sentence at the end of [URI].
in section 3.3; this causes no change to the list of characters allowed
in IDN name parts. Added new characters that were added in this
revision.
G: Revised the table based on date from Unicode 3.1. F: Updated the table from the changes in 5 above.
C. IANA Considerations C. IANA Considerations
None. None.
D. Author Contact Information D. Author Contact Information
Paul Hoffman Paul Hoffman
Internet Mail Consortium and VPN Consortium Internet Mail Consortium and VPN Consortium
127 Segre Place 127 Segre Place
skipping to change at line 2112 skipping to change at line 2107
F. Prohibited Code Point List F. Prohibited Code Point List
----- Start Prohibited Table ----- ----- Start Prohibited Table -----
0000-002C 0000-002C
002E-002F 002E-002F
003A-0040 003A-0040
005B-0060 005B-0060
007B-007F 007B-007F
0080-009F 0080-009F
00A0 00A0
070F
1680 1680
180E
2000 2000
2001 2001
2002 2002
2003 2003
2004 2004
2005 2005
2006 2006
2007 2007
2008 2008
2009 2009
skipping to change at line 2134 skipping to change at line 2131
200E 200E
200F 200F
2028 2028
2029 2029
202A 202A
202B 202B
202C 202C
202D 202D
202E 202E
202F 202F
2060-206F 206A-206F
2FF0-2FFF 2FF0-2FFB
3000 3000
3002 3002
D800-DFFF D800-DFFF
E000-F8FF E000-F8FF
FFF0-FFFC FDD0-FDEF
FFF9-FFFC
FFFD FFFD
FFFE-FFFF FFFE-FFFF
1D173-1D17A
1FFFE-1FFFF 1FFFE-1FFFF
2FFFE-2FFFF 2FFFE-2FFFF
3FFFE-3FFFF 3FFFE-3FFFF
4FFFE-4FFFF 4FFFE-4FFFF
5FFFE-5FFFF 5FFFE-5FFFF
6FFFE-6FFFF 6FFFE-6FFFF
7FFFE-7FFFF 7FFFE-7FFFF
8FFFE-8FFFF 8FFFE-8FFFF
9FFFE-9FFFF 9FFFE-9FFFF
AFFFE-AFFFF AFFFE-AFFFF
BFFFE-BFFFF BFFFE-BFFFF
CFFFE-CFFFF CFFFE-CFFFF
DFFFE-DFFFF DFFFE-DFFFF
E0000-E0FFF E0001
E0020-E007F
EFFFE-EFFFF EFFFE-EFFFF
F0000-FFFFD F0000-FFFFD
FFFFE-FFFFF FFFFE-FFFFF
100000-10FFFD 100000-10FFFD
10FFFE-10FFFF 10FFFE-10FFFF
----- End Prohibited Table ----- ----- End Prohibited Table -----
NOTE WELL: Software that follows this specification that will be used to NOTE WELL: Software that follows this specification that will be used to
check names before they are put in authoritative name servers MUST add check names before they are put in authoritative name servers MUST add
all unassigned code pints to the list of characters that are prohibited. all unassigned code pints to the list of characters that are prohibited.
See Section 6 for more details. See Section 6 for more details.
G. Unassigned Code Point List G. Unassigned Code Point List
----- Start Unassigned Table ----- ----- Start Unassigned Table -----
skipping to change at line 2573 skipping to change at line 2574
70000-7FFFD 70000-7FFFD
80000-8FFFD 80000-8FFFD
90000-9FFFD 90000-9FFFD
A0000-AFFFD A0000-AFFFD
B0000-BFFFD B0000-BFFFD
C0000-CFFFD C0000-CFFFD
D0000-DFFFD D0000-DFFFD
E0000 E0000
E0002-E001F E0002-E001F
E0080-EFFFD E0080-EFFFD
10FFFE-10FFFF
----- End Unassigned Table ----- ----- End Unassigned Table -----
 End of changes. 62 change blocks. 
130 lines changed or deleted 130 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/