draft-ietf-ldapbis-strprep-01.txt   draft-ietf-ldapbis-strprep-02.txt 
Internet-Draft Kurt D. Zeilenga Internet-Draft Kurt D. Zeilenga
Intended Category: Standard Track OpenLDAP Foundation Intended Category: Standard Track OpenLDAP Foundation
Expires in six months 30 June 2003 Expires in six months 27 October 2003
LDAP: Internationalized String Preparation LDAP: Internationalized String Preparation
<draft-ietf-ldapbis-strprep-01.txt> <draft-ietf-ldapbis-strprep-02.txt>
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026. provisions of Section 10 of RFC2026.
Distribution of this memo is unlimited. Technical discussion of this Distribution of this memo is unlimited. Technical discussion of this
document will take place on the IETF LDAP Revision Working Group document will take place on the IETF LDAP Revision Working Group
mailing list <ietf-ldapbis@openldap.org>. Please send editorial mailing list <ietf-ldapbis@openldap.org>. Please send editorial
comments directly to the author <Kurt@OpenLDAP.org>. comments directly to the author <Kurt@OpenLDAP.org>.
skipping to change at page 3, line 28 skipping to change at page 3, line 28
The lack of precise specification for character string matching has The lack of precise specification for character string matching has
led to significant interoperability problems. When used in led to significant interoperability problems. When used in
certificate chain validation, security vulnerabilities can arise. To certificate chain validation, security vulnerabilities can arise. To
address these problems, this document defines precise algorithms for address these problems, this document defines precise algorithms for
preparing character strings for matching. preparing character strings for matching.
1.3. Relationship to "stringprep" 1.3. Relationship to "stringprep"
The character string preparation algorithms described in this document The character string preparation algorithms described in this document
are based upon the "stringprep" approach [RFC3454]. In "stringprep", are based upon the "stringprep" approach [StringPrep]. In
presented and stored values are first prepared for comparison and so "stringprep", presented and stored values are first prepared for
that a character-by-character comparison yields the "correct" result. comparison and so that a character-by-character comparison yields the
"correct" result.
The approach used here is a refinement of the "stringprep" [RFC3454] The approach used here is a refinement of the "stringprep"
approach. Each algorithm involves two additional preparation steps. [StringPrep] approach. Each algorithm involves two additional
preparation steps.
a) prior to applying the Unicode string preparation steps outlined in a) prior to applying the Unicode string preparation steps outlined in
"stringprep", the string is transcoded to Unicode; "stringprep", the string is transcoded to Unicode;
b) after applying the Unicode string preparation steps outlined in b) after applying the Unicode string preparation steps outlined in
"stringprep", characters insignificant to the matching rules are "stringprep", characters insignificant to the matching rules are
removed. removed.
Hence, preparation of character strings for X.500 matching involves Hence, preparation of character strings for X.500 matching involves
the following steps: the following steps:
skipping to change at page 5, line 15 skipping to change at page 5, line 18
TeletexString [X.680][T.61] values are transcoded to Unicode as TeletexString [X.680][T.61] values are transcoded to Unicode as
described in Appendix A. described in Appendix A.
PrintableString [X.680] value are transcoded directly to Unicode. PrintableString [X.680] value are transcoded directly to Unicode.
UniversalString, UTF8String, and bmpString [X.680] values need not be UniversalString, UTF8String, and bmpString [X.680] values need not be
transcoded as they are Unicode-based strings (in the case of transcoded as they are Unicode-based strings (in the case of
bmpString, a subset of Unicode). bmpString, a subset of Unicode).
If the implementation is unable or unwilling to perform the The output is the transcoded string.
transcoding as described above, or the transcoding fails, this step
fails and the assertion is evaluated to Undefined.
The transcoded string is the output string.
2.2. Map 2.2. Map
SOFT HYPHEN (U+00AD) and MONGOLIAN TODO SOFT HYPHEN (U+1806) code SOFT HYPHEN (U+00AD) and MONGOLIAN TODO SOFT HYPHEN (U+1806) code
points are mapped to nothing. COMBINING GRAPHEME JOINER (U+034F) and points are mapped to nothing. COMBINING GRAPHEME JOINER (U+034F) and
VARIATION SELECTORs (U+180B-180D,FF00-FE0F) code points are also VARIATION SELECTORs (U+180B-180D,FF00-FE0F) code points are also
mapped to nothing. The OBJECT REPLACEMENT CHARACTER (U+FFFC) is mapped to nothing. The OBJECT REPLACEMENT CHARACTER (U+FFFC) is
mapped to nothing. mapped to nothing.
CHARACTER TABULATION (U+0009), LINE FEED (LF) (U+000A), LINE CHARACTER TABULATION (U+0009), LINE FEED (LF) (U+000A), LINE
skipping to change at page 5, line 41 skipping to change at page 5, line 40
(U+000D), and NEXT LINE (NEL) (U+0085) are mapped to SPACE (U+0020). (U+000D), and NEXT LINE (NEL) (U+0085) are mapped to SPACE (U+0020).
All other control code points (e.g., Cc) or code points with a control All other control code points (e.g., Cc) or code points with a control
function (e.g., Cf) are mapped to nothing. function (e.g., Cf) are mapped to nothing.
ZERO WIDTH SPACE (U+200B) is mapped to nothing. All other code points ZERO WIDTH SPACE (U+200B) is mapped to nothing. All other code points
with Separator (space, line, or paragraph) property (e.g, Zs, Zl, or with Separator (space, line, or paragraph) property (e.g, Zs, Zl, or
Zp) are mapped to SPACE (U+0020). Zp) are mapped to SPACE (U+0020).
For case ignore, numeric, and stored prefix string matching rules, For case ignore, numeric, and stored prefix string matching rules,
characters are case folded per B.2 of [RFC3454]. characters are case folded per B.2 of [StringPrep].
The output is the mapped string.
2.3. Normalize 2.3. Normalize
The input string is be normalized to Unicode Form KC (compatibility The input string is be normalized to Unicode Form KC (compatibility
composed) as described in [UAX15]. composed) as described in [UAX15]. The output is the normalized
string.
2.4. Prohibit 2.4. Prohibit
All Unassigned, Private Use, and non-character code points are
prohibited. Surrogate codes (U+D800-DFFFF) are prohibited. All Unassigned code points are prohibited. Unassigned code points are
listed in Table A.1 of [StringPrep].
Private Use (U+E000-F8FF, F0000-FFFFD, 100000-10FFFD) code points are
prohibited.
All non-character code points (U+FDD0-FDEF, FFFE-FFFF, 1FFFE-1FFFF,
2FFFE-2FFFF, 3FFFE-3FFFF, 4FFFE-4FFFF, 5FFFE-5FFFF, 6FFFE-6FFFF,
7FFFE-7FFFF, 8FFFE-8FFFF, 9FFFE-9FFFF, AFFFE-AFFFF, BFFFE-BFFFF,
CFFFE-CFFFF, DFFFE-DFFFF, EFFFE-EFFFF, FFFFE-FFFFF, 10FFFE-10FFFF) are
prohibited.
Surrogate codes (U+D800-DFFFF) are prohibited.
The REPLACEMENT CHARACTER (U+FFFD) code point is prohibited. The REPLACEMENT CHARACTER (U+FFFD) code point is prohibited.
The first code point of a string is prohibited from being a combining The first code point of a string is prohibited from being a combining
character. character.
The step fails and the assertion is evaluated to Undefined if the The step fails if the input string contains any prohibited code point.
input string contains any prohibited code point. The output string is The output is the input string.
the input string.
2.5. Check bidi 2.5. Check bidi
There are no bidirectional restrictions. The output string is the There are no bidirectional restrictions. The output is the input
input string. string.
2.5. Insignificant Character Removal 2.5. Insignificant Character Removal
In this step, characters insignificant to the matching rule are to be In this step, characters insignificant to the matching rule are to be
removed. The characters to be removed differ from matching rule to removed. The characters to be removed differ from matching rule to
matching rule. matching rule.
Section 2.5.1 applies to case ignore and exact string matching. Section 2.5.1 applies to case ignore and exact string matching.
Section 2.5.2 applies to numericString matching. Section 2.5.2 applies to numericString matching.
Section 2.5.3 applies to telephoneNumber matching Section 2.5.3 applies to telephoneNumber matching
skipping to change at page 8, line 11 skipping to change at page 8, line 24
"<SPACE><HYPHEN>123<SPACE><SPACE>456<SPACE><HYPHEN>" "<SPACE><HYPHEN>123<SPACE><SPACE>456<SPACE><HYPHEN>"
would result in the output string: would result in the output string:
"123456" "123456"
and the Form KC string: and the Form KC string:
"<HYPHEN><HYPHEN><HYPHEN>" "<HYPHEN><HYPHEN><HYPHEN>"
would result in the output string: would result in the output string:
"<SPACE>". "<SPACE>".
3. Security Considerations 3. Security Considerations
"Preparation for International Strings ('stringprep')" [RFC3454] "Preparation for International Strings ('stringprep')" [StringPrep]
security considerations generally apply to the algorithms described security considerations generally apply to the algorithms described
here. here.
4. Contributors 4. Contributors
Appendix A and B of this document were authored by Howard Chu Appendix A and B of this document were authored by Howard Chu
<hyc@symas.com> of Symas Corporation (based upon information provided <hyc@symas.com> of Symas Corporation (based upon information provided
in RFC 1345). in RFC 1345).
5. Acknowledgments 5. Acknowledgments
The approach used in this document is based upon design principles and The approach used in this document is based upon design principles and
algorithms described in "Preparation of Internationalized Strings algorithms described in "Preparation of Internationalized Strings
('stringprep')" [RFC3454] by Paul Hoffman and Marc Blanchet. Some ('stringprep')" [StringPrep] by Paul Hoffman and Marc Blanchet. Some
additional guidance was drawn from Unicode Technical Standards, additional guidance was drawn from Unicode Technical Standards,
Technical Reports, and Notes. Technical Reports, and Notes.
This document is a product of the IETF LDAP Revision (LDAPBIS) Working This document is a product of the IETF LDAP Revision (LDAPBIS) Working
Group. Group.
6. Author's Address 6. Author's Address
Kurt Zeilenga Kurt Zeilenga
E-mail: <kurt@openldap.org> E-mail: <kurt@openldap.org>
7. References 7. References
7.1. Normative References 7.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14 (also RFC 2119), March 1997. Requirement Levels", BCP 14 (also RFC 2119), March 1997.
[RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
Internationalized Strings ('stringprep')", RFC 3454,
December 2002.
[Roadmap] Zeilenga, K. (editor), "LDAP: Technical Specification [Roadmap] Zeilenga, K. (editor), "LDAP: Technical Specification
Road Map", draft-ietf-ldapbis-roadmap-xx.txt, a work in Road Map", draft-ietf-ldapbis-roadmap-xx.txt, a work in
progress. progress.
[StringPrep] Hoffman P. and M. Blanchet, "Preparation of
Internationalized Strings ('stringprep')",
draft-hoffman-rfc3454bis-xx.txt, a work in progress.
[Syntaxes] Legg, S. (editor), "LDAP: Syntaxes and Matching Rules", [Syntaxes] Legg, S. (editor), "LDAP: Syntaxes and Matching Rules",
draft-ietf-ldapbis-syntaxes-xx.txt, a work in progress. draft-ietf-ldapbis-syntaxes-xx.txt, a work in progress.
[ISO10646] International Organization for Standardization, [ISO10646] International Organization for Standardization,
"Universal Multiple-Octet Coded Character Set (UCS) - "Universal Multiple-Octet Coded Character Set (UCS) -
Architecture and Basic Multilingual Plane", ISO/IEC Architecture and Basic Multilingual Plane", ISO/IEC
10646-1 : 1993. 10646-1 : 1993.
[Unicode] The Unicode Consortium, "The Unicode Standard, Version [Unicode] The Unicode Consortium, "The Unicode Standard, Version
3.2.0" is defined by "The Unicode Standard, Version 3.0" 3.2.0" is defined by "The Unicode Standard, Version 3.0"
skipping to change at page 10, line 25 skipping to change at page 10, line 40
for X.500", draft-zeilenga-ldapbis-strmatch-xx.txt, a for X.500", draft-zeilenga-ldapbis-strmatch-xx.txt, a
work in progress. work in progress.
[RFC1345] Simonsen, K., "Character Mnemonics & Character Sets", [RFC1345] Simonsen, K., "Character Mnemonics & Character Sets",
RFC 1345, June 1992. RFC 1345, June 1992.
Appendix A. Teletex (T.61) to Unicode Appendix A. Teletex (T.61) to Unicode
This appendix defines an algorithm for transcoding [T.61] characters This appendix defines an algorithm for transcoding [T.61] characters
to [Unicode] characters for use in string preparation for LDAP to [Unicode] characters for use in string preparation for LDAP
matching rules. This appendix is a normative. matching rules. This appendix is normative.
The transcoding algorithm is derived from the T.61-8bit definition The transcoding algorithm is derived from the T.61-8bit definition
provided in [RFC1345]. With a few exceptions, the T.61 character provided in [RFC1345]. With a few exceptions, the T.61 character
codes from x00 to x7f are equivalent to the corresponding [Unicode] codes from x00 to x7f are equivalent to the corresponding [Unicode]
code points, and their values are left unchanged by this algorithm. code points, and their values are left unchanged by this algorithm.
E.g. the T.61 code x20 is identical to (U+0020). The exceptions are E.g. the T.61 code x20 is identical to (U+0020). The exceptions are
for these T.61 codes that are undefined: x23, x24, x5c, x5e, x60, x7b, for these T.61 codes that are undefined: x23, x24, x5c, x5e, x60, x7b,
x7d, and x7e. x7d, and x7e.
The codes from x80 to x9f are also equivalent to the corresponding The codes from x80 to x9f are also equivalent to the corresponding
skipping to change at page 10, line 47 skipping to change at page 11, line 14
these codes are control characters, and will be mapped to nothing in these codes are control characters, and will be mapped to nothing in
the LDAP String Preparation Mapping step. the LDAP String Preparation Mapping step.
The remaining T.61 codes are mapped below in Table A.1. Table The remaining T.61 codes are mapped below in Table A.1. Table
positions marked "??" are undefined. positions marked "??" are undefined.
Input strings containing undefined T.61 codes SHALL produce an Input strings containing undefined T.61 codes SHALL produce an
Undefined matching result. For diagnostic purposes, this algorithm Undefined matching result. For diagnostic purposes, this algorithm
does not fail for undefined input codes. Instead, undefined codes in does not fail for undefined input codes. Instead, undefined codes in
the input are mapped to the Unicode REPLACEMENT CHARACTER (U+FFFD). the input are mapped to the Unicode REPLACEMENT CHARACTER (U+FFFD).
As the LDAP String Preparation Probhibit step disallows the As the LDAP String Preparation Prohibit step disallows the REPLACEMENT
REPLACEMENT CHARACTER from appearing in its output, this transcoding CHARACTER from appearing in its output, this transcoding yields the
yields the desired effect. desired effect.
Note: RFC 1345 listed the non-spacing accent codepoints as residing in Note: RFC 1345 listed the non-spacing accent codepoints as residing in
the range starting at (U+E000). In the current Unicode the range starting at (U+E000). In the current Unicode
standard, the (U+E000) range is reserved for Private Use, and standard, the (U+E000) range is reserved for Private Use, and
the non-spacing accents are in the range starting at (U+0300). the non-spacing accents are in the range starting at (U+0300).
The tables here use the (U+0300) range for these accents. The tables here use the (U+0300) range for these accents.
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
--+------+------+------+------+------+------+------+------+ --+------+------+------+------+------+------+------+------+
a0| 00a0 | 00a1 | 00a2 | 00a3 | 0024 | 00a5 | 0023 | 00a7 | a0| 00a0 | 00a1 | 00a2 | 00a3 | 0024 | 00a5 | 0023 | 00a7 |
skipping to change at page 11, line 38 skipping to change at page 12, line 4
T.61 also defines a number of accented characters that are formed by T.61 also defines a number of accented characters that are formed by
combining an accent prefix followed by a base character. These combining an accent prefix followed by a base character. These
prefixes are in the code range xc1 to xcf. If a prefix character prefixes are in the code range xc1 to xcf. If a prefix character
appears at the end of a string, the result is undefined. Otherwise appears at the end of a string, the result is undefined. Otherwise
these sequences are mapped to Unicode by substituting the these sequences are mapped to Unicode by substituting the
corresponding non-spacing accent code (as listed in Table A.1) for the corresponding non-spacing accent code (as listed in Table A.1) for the
accent prefix, and exchanging the order so that the base character accent prefix, and exchanging the order so that the base character
precedes the accent. precedes the accent.
Appendix B. Additional Teletex (T.61) to Unicode Tables Appendix B. Additional Teletex (T.61) to Unicode Tables
All of the accented characters in T.61 have a corresponding code point All of the accented characters in T.61 have a corresponding code point
in Unicode. For the sake of completeness, the combined character in Unicode. For the sake of completeness, the combined character
codes are presented in the following tables. This is informational codes are presented in the following tables. This is informational
only; for matching purposes it is sufficient to map the non-spacing only; for matching purposes it is sufficient to map the non-spacing
accent and exchange the order of the character pair as specified in accent and exchange the order of the character pair as specified in
Appendix A. Appendix A. This appendix is informative.
B.1. Combinations with SPACE B.1. Combinations with SPACE
Accents may be combined with a <SPACE> to generate the accent by Accents may be combined with a <SPACE> to generate the accent by
itself. For each accent code, the result of combining with <SPACE> is itself. For each accent code, the result of combining with <SPACE> is
listed in Table B.1. listed in Table B.1.
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
--+------+------+------+------+------+------+------+------+ --+------+------+------+------+------+------+------+------+
c0| ?? | 0060 | 00b4 | 005e | 007e | 00af | 02d8 | 02d9 | c0| ?? | 0060 | 00b4 | 005e | 007e | 00af | 02d8 | 02d9 |
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/