draft-ietf-idn-nameprep-06.txt   draft-ietf-idn-nameprep-07.txt 
Internet Draft Paul Hoffman Internet Draft Paul Hoffman
draft-ietf-idn-nameprep-06.txt IMC & VPNC draft-ietf-idn-nameprep-07.txt IMC & VPNC
September 27, 2001 Marc Blanchet January 9, 2001 Marc Blanchet
Expires in six months ViaGenie Expires in six months ViaGenie
Stringprep Profile for Internationalized Host Names Stringprep Profile for Internationalized Host Names
Status of this memo Status of this memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026. provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering Task
skipping to change at line 166 skipping to change at line 166
The collected lists of prohibited code points can be found in Appendix E The collected lists of prohibited code points can be found in Appendix E
of this document. The lists in Appendix E MUST be used by implementations of this document. The lists in Appendix E MUST be used by implementations
of this specification. If there are any discrepancies between the lists of this specification. If there are any discrepancies between the lists
in Appendix E and subsections below, the lists in Appendix E always takes in Appendix E and subsections below, the lists in Appendix E always takes
precedence. precedence.
Some code points listed in one section would also appear in other Some code points listed in one section would also appear in other
sections. Each code point is only listed once in the tables in Appendix sections. Each code point is only listed once in the tables in Appendix
E. E.
5.1 Currently-prohibited ASCII characters 5.1 Space characters
Some of the ASCII characters that are currently prohibited in host names
by [STD13] are also used in protocol elements such as URIs [URI]. The other
characters in the range U+0000 to U+007F that are not currently allowed
are also prohibited in host name parts to reserve them for future use in
protocol elements.
0000-002C; [ASCII CONTROL CHARACTERS and SPACE through ,]
002E-002F; [ASCII . through /]
003A-0040; [ASCII : through @]
005B-0060; [ASCII [ through `]
007B-007F; [ASCII { through DEL]
5.2 Space characters
Space characters would make visual transcription of URLs nearly Space characters would make visual transcription of URLs nearly
impossible and could lead to user entry errors in many ways. impossible and could lead to user entry errors in many ways.
0020; SPACE 0020; SPACE
00A0; NO-BREAK SPACE 00A0; NO-BREAK SPACE
1680; OGHAM SPACE MARK 1680; OGHAM SPACE MARK
2000; EN QUAD 2000; EN QUAD
2001; EM QUAD 2001; EM QUAD
2002; EN SPACE 2002; EN SPACE
skipping to change at line 202 skipping to change at line 188
2004; THREE-PER-EM SPACE 2004; THREE-PER-EM SPACE
2005; FOUR-PER-EM SPACE 2005; FOUR-PER-EM SPACE
2006; SIX-PER-EM SPACE 2006; SIX-PER-EM SPACE
2007; FIGURE SPACE 2007; FIGURE SPACE
2008; PUNCTUATION SPACE 2008; PUNCTUATION SPACE
2009; THIN SPACE 2009; THIN SPACE
200A; HAIR SPACE 200A; HAIR SPACE
202F; NARROW NO-BREAK SPACE 202F; NARROW NO-BREAK SPACE
3000; IDEOGRAPHIC SPACE 3000; IDEOGRAPHIC SPACE
5.3 Control characters 5.2 Control characters
Control characters (or characters with control function) cannot be seen Control characters (or characters with control function) cannot be seen
and can cause unpredictable results when displayed. and can cause unpredictable results when displayed.
0000-001F; [CONTROL CHARACTERS] 0000-001F; [CONTROL CHARACTERS]
007F; DELETE 007F; DELETE
0080-009F; [CONTROL CHARACTERS] 0080-009F; [CONTROL CHARACTERS]
070F; SYRIAC ABBREVIATION MARK 070F; SYRIAC ABBREVIATION MARK
180E; MONGOLIAN VOWEL SEPARATOR 180E; MONGOLIAN VOWEL SEPARATOR
2028; LINE SEPARATOR 2028; LINE SEPARATOR
2029; PARAGRAPH SEPARATOR 2029; PARAGRAPH SEPARATOR
206A-206F; [CONTROL CHARACTERS] 206A-206F; [CONTROL CHARACTERS]
FFF9-FFFC; [CONTROL CHARACTERS] FFF9-FFFC; [CONTROL CHARACTERS]
1D173-1D17A; [MUSICAL CONTROL CHARACTERS] 1D173-1D17A; [MUSICAL CONTROL CHARACTERS]
5.4 Private use and replacement characters 5.3 Private use and replacement characters
Because private-use characters do not have defined meanings, they are Because private-use characters do not have defined meanings, they are
prohibited. The private-use characters are: prohibited. The private-use characters are:
E000-F8FF; [PRIVATE USE, PLANE 0] E000-F8FF; [PRIVATE USE, PLANE 0]
F0000-FFFFD; [PRIVATE USE, PLANE 15] F0000-FFFFD; [PRIVATE USE, PLANE 15]
100000-10FFFD; [PRIVATE USE, PLANE 16] 100000-10FFFD; [PRIVATE USE, PLANE 16]
The replacement character (U+FFFD) has no known semantic definition in a The replacement character (U+FFFD) has no known semantic definition in a
name, and is often displayed by renderers to indicate "there would be name, and is often displayed by renderers to indicate "there would be
some character here, but it cannot be rendered". For example, on a some character here, but it cannot be rendered". For example, on a
computer with no Asian fonts, a name with three ideographs might be computer with no Asian fonts, a name with three ideographs might be
rendered with three replacement characters. rendered with three replacement characters.
FFFD; REPLACEMENT CHARACTER FFFD; REPLACEMENT CHARACTER
5.5 Non-character code points 5.4 Non-character code points
Non-character code points are code points that have been allocated in Non-character code points are code points that have been allocated in
ISO/IEC 10646 but are not characters. Because they are already assigned, ISO/IEC 10646 but are not characters. Because they are already assigned,
they are guaranteed not to later change into characters. they are guaranteed not to later change into characters.
FDD0-FDEF; [NONCHARACTER CODE POINTS] FDD0-FDEF; [NONCHARACTER CODE POINTS]
FFFE-FFFF; [NONCHARACTER CODE POINTS] FFFE-FFFF; [NONCHARACTER CODE POINTS]
1FFFE-1FFFF; [NONCHARACTER CODE POINTS] 1FFFE-1FFFF; [NONCHARACTER CODE POINTS]
2FFFE-2FFFF; [NONCHARACTER CODE POINTS] 2FFFE-2FFFF; [NONCHARACTER CODE POINTS]
3FFFE-3FFFF; [NONCHARACTER CODE POINTS] 3FFFE-3FFFF; [NONCHARACTER CODE POINTS]
skipping to change at line 263 skipping to change at line 249
BFFFE-BFFFF; [NONCHARACTER CODE POINTS] BFFFE-BFFFF; [NONCHARACTER CODE POINTS]
CFFFE-CFFFF; [NONCHARACTER CODE POINTS] CFFFE-CFFFF; [NONCHARACTER CODE POINTS]
DFFFE-DFFFF; [NONCHARACTER CODE POINTS] DFFFE-DFFFF; [NONCHARACTER CODE POINTS]
EFFFE-EFFFF; [NONCHARACTER CODE POINTS] EFFFE-EFFFF; [NONCHARACTER CODE POINTS]
FFFFE-FFFFF; [NONCHARACTER CODE POINTS] FFFFE-FFFFF; [NONCHARACTER CODE POINTS]
10FFFE-10FFFF; [NONCHARACTER CODE POINTS] 10FFFE-10FFFF; [NONCHARACTER CODE POINTS]
The non-character code points are listed the PropList.txt file from the The non-character code points are listed the PropList.txt file from the
Unicode database. Unicode database.
5.6 Surrogate codes 5.5 Surrogate codes
The following code points are permanently reserved for use as surrogate The following code points are permanently reserved for use as surrogate
code values in the UTF-16 encoding, will never be assigned to code values in the UTF-16 encoding, will never be assigned to
characters, and are therefore prohibited: characters, and are therefore prohibited:
D800-DFFF; [SURROGATE CODES] D800-DFFF; [SURROGATE CODES]
5.7 Inappropriate for plain text 5.6 Inappropriate for plain text
The following characters should not appear in regular text. The following characters should not appear in regular text.
FFF9; INTERLINEAR ANNOTATION ANCHOR FFF9; INTERLINEAR ANNOTATION ANCHOR
FFFA; INTERLINEAR ANNOTATION SEPARATOR FFFA; INTERLINEAR ANNOTATION SEPARATOR
FFFB; INTERLINEAR ANNOTATION TERMINATOR FFFB; INTERLINEAR ANNOTATION TERMINATOR
FFFC; OBJECT REPLACEMENT CHARACTER FFFC; OBJECT REPLACEMENT CHARACTER
5.8 Inappropriate for canonical representation 5.7 Inappropriate for canonical representation
The ideographic description characters allow different sequences of The ideographic description characters allow different sequences of
characters to be rendered the same way, which makes them inappropriate characters to be rendered the same way, which makes them inappropriate
for host names that must have a single canonical representation. for host names that must have a single canonical representation.
2FF0-2FFB; [IDEOGRAPHIC DESCRIPTION CHARACTERS] 2FF0-2FFB; [IDEOGRAPHIC DESCRIPTION CHARACTERS]
5.9 Change display properties 5.8 Change display properties
The following characters, some of which are deprecated in ISO/IEC 10646, The following characters, some of which are deprecated in ISO/IEC 10646,
can cause changes in display or the order in which characters appear can cause changes in display or the order in which characters appear
when rendered. when rendered.
200E; LEFT-TO-RIGHT MARK 200E; LEFT-TO-RIGHT MARK
200F; RIGHT-TO-LEFT MARK 200F; RIGHT-TO-LEFT MARK
202A; LEFT-TO-RIGHT EMBEDDING 202A; LEFT-TO-RIGHT EMBEDDING
202B; RIGHT-TO-LEFT EMBEDDING 202B; RIGHT-TO-LEFT EMBEDDING
202C; POP DIRECTIONAL FORMATTING 202C; POP DIRECTIONAL FORMATTING
202D; LEFT-TO-RIGHT OVERRIDE 202D; LEFT-TO-RIGHT OVERRIDE
202E; RIGHT-TO-LEFT OVERRIDE 202E; RIGHT-TO-LEFT OVERRIDE
206A; INHIBIT SYMMETRIC SWAPPING 206A; INHIBIT SYMMETRIC SWAPPING
206B; ACTIVATE SYMMETRIC SWAPPING 206B; ACTIVATE SYMMETRIC SWAPPING
206C; INHIBIT ARABIC FORM SHAPING 206C; INHIBIT ARABIC FORM SHAPING
206D; ACTIVATE ARABIC FORM SHAPING 206D; ACTIVATE ARABIC FORM SHAPING
206E; NATIONAL DIGIT SHAPES 206E; NATIONAL DIGIT SHAPES
206F; NOMINAL DIGIT SHAPES 206F; NOMINAL DIGIT SHAPES
5.10 Inappropriate characters from common input mechanisms 5.9 Inappropriate characters from common input mechanisms
U+3002 is used as if it were U+002E in many input mechanisms, U+3002 is used as if it were U+002E in many input mechanisms,
particularly in Asia. This prohibition allows input mechanisms to safely particularly in Asia. This prohibition allows input mechanisms to safely
map U+3002 to U+002E before doing stringprep without worrying about map U+3002 to U+002E before doing stringprep without worrying about
preventing users from accessing legitimate host name parts. preventing users from accessing legitimate host name parts.
3002; IDEOGRAPHIC FULL STOP 3002; IDEOGRAPHIC FULL STOP
5.11 Tagging characters 5.10 Tagging characters
The following characters are used for tagging text and are invisible. The following characters are used for tagging text and are invisible.
E0001; LANGUAGE TAG E0001; LANGUAGE TAG
E0020-E007F; [TAGGING CHARACTERS] E0020-E007F; [TAGGING CHARACTERS]
6. Unassigned Code Points in Internationalized Host Names 6. Unassigned Code Points in Internationalized Host Names
This profile lists the unassigned code points for Unicode 3.1 in This profile lists the unassigned code points for Unicode 3.1 in
Appendix F. The list in Appendix F MUST be used by implementations of Appendix F. The list in Appendix F MUST be used by implementations of
skipping to change at line 395 skipping to change at line 381
Literal Addresses in URL's", December 1999, RFC 2732. Note that Literal Addresses in URL's", December 1999, RFC 2732. Note that
there are many other RFCs that define additional URI schemes. there are many other RFCs that define additional URI schemes.
[UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15: [UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15:
Unicode Normalization Forms, Version 3.1.0. Unicode Normalization Forms, Version 3.1.0.
<http://www.unicode.org/unicode/reports/tr15/tr15-21.html> <http://www.unicode.org/unicode/reports/tr15/tr15-21.html>
[UTR21] Mark Davis. Case Mappings. Unicode Technical Report;21. [UTR21] Mark Davis. Case Mappings. Unicode Technical Report;21.
<http://www.unicode.org/unicode/reports/tr21/>. <http://www.unicode.org/unicode/reports/tr21/>.
9. Differences Between -05 and -06 Drafts 9. Differences Between -06 and -07 Drafts
Throughout: became a profile of stringprep.
A: Added Dave Crocker 5: Removed 5.1 (currently-used ASCII characters) and renumbered
the entire section.
B: Made this section 9 to ease later renumbering. Relettered other E: Removed the characters that appeared in the old 5.1.
appendicies.
A. Acknowledgements A. Acknowledgements
Many people from the IETF IDN Working Group and the Unicode Technical Many people from the IETF IDN Working Group and the Unicode Technical
Committee contributed ideas that went into the first draft of this Committee contributed ideas that went into the first draft of this
document. document.
The IDN namprep design team made many useful changes to the first The IDN namprep design team made many useful changes to the first
draft. That team and its advisors include: draft. That team and its advisors include:
skipping to change at line 1830 skipping to change at line 1814
1D7A5; 03C6; Additional folding 1D7A5; 03C6; Additional folding
1D7A6; 03C7; Additional folding 1D7A6; 03C7; Additional folding
1D7A7; 03C8; Additional folding 1D7A7; 03C8; Additional folding
1D7A8; 03C9; Additional folding 1D7A8; 03C9; Additional folding
1D7BB; 03C3; Additional folding 1D7BB; 03C3; Additional folding
----- End Mapping Table ----- ----- End Mapping Table -----
E. Prohibited Code Point List E. Prohibited Code Point List
----- Start Prohibited Table ----- ----- Start Prohibited Table -----
0000-002C 0000-0020
002E-002F 007F
003A-0040
005B-0060
007B-007F
0080-009F 0080-009F
00A0 00A0
070F 070F
1680 1680
180E 180E
2000 2000
2001 2001
2002 2002
2003 2003
2004 2004
 End of changes. 15 change blocks. 
37 lines changed or deleted 18 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/