draft-ietf-idn-race-02.txt   draft-ietf-idn-race-03.txt 
Internet Draft Paul Hoffman Internet Draft Paul Hoffman
draft-ietf-idn-race-02.txt IMC & VPNC draft-ietf-idn-race-03.txt IMC & VPNC
October 16, 2000 November 22, 2000
Expires in six months Expires in six months
RACE: Row-based ASCII Compatible Encoding for IDN RACE: Row-based ASCII Compatible Encoding for IDN
Status of this memo Status of this memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026. provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering Task
skipping to change at line 62 skipping to change at line 62
or arch-3. RACE specifies an ACE format as specified in ace-1 in or arch-3. RACE specifies an ACE format as specified in ace-1 in
[IDNComp]. Further, it specifies an identifying mechanism for ace-2 in [IDNComp]. Further, it specifies an identifying mechanism for ace-2 in
[IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to the [IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to the
beginning of the name part). beginning of the name part).
Author's note: although earlier drafts of this document supported the Author's note: although earlier drafts of this document supported the
ideas in arch-3, I no longer support that idea and instead only support ideas in arch-3, I no longer support that idea and instead only support
arch-2. Of course, someone else might right an IDN proposal that matches arch-2. Of course, someone else might right an IDN proposal that matches
arch-3 and use RACE as the protocol. arch-3 and use RACE as the protocol.
In formal terms, RACE describes a character encoding scheme of the ISO In formal terms, RACE describes a character encoding scheme of the
10646 [ISO10646] coded character set and the rules for using that scheme ISO/IEC 10646 [ISO10646] coded character set (whose assignment of
in the DNS. As such, it could also be called a "charset" as defined in characters is synchronized with Unicode [Unicode3]) and the rules for
[IDNReq]. using that scheme in the DNS. As such, it could also be called a
"charset" as defined in [IDNReq].
The RACE protocol has the following features: The RACE protocol has the following features:
- There is exactly one way to convert internationalized host parts to - There is exactly one way to convert internationalized host parts to
and from RACE parts. Host name part uniqueness is preserved. and from RACE parts. Host name part uniqueness is preserved.
- Host parts that have no international characters are not changed. - Host parts that have no international characters are not changed.
- Names using RACE can include more internationalized characters than - Names using RACE can include more internationalized characters than
with other ACE protocols that have been suggested to date. Names in the with other ACE protocols that have been suggested to date. Names in the
skipping to change at line 170 skipping to change at line 171
illegal characters. illegal characters.
2.2 Converting an internationalized name to an ACE name part 2.2 Converting an internationalized name to an ACE name part
To convert a string of internationalized characters into an ACE name To convert a string of internationalized characters into an ACE name
part, the following steps MUST be preformed in the exact order of the part, the following steps MUST be preformed in the exact order of the
subsections given here. subsections given here.
If a name part consists exclusively of characters that conform to the If a name part consists exclusively of characters that conform to the
host name requirements in [STD13], the name MUST NOT be converted to host name requirements in [STD13], the name MUST NOT be converted to
RACE. That is, a name part that can be represented without RACE MUST NOT LACE. That is, a name part that can be represented without LACE MUST NOT
be encoded using RACE. This absolute requirement prevents there from be encoded using LACE. This absolute requirement prevents there from
being two different encodings for a single DNS host name. being two different encodings for a single DNS host name.
If any checking for prohibited name parts (such as ones that are If any checking for prohibited name parts (such as ones that are
prohibited characters, case-folding, or canonicalization) is to be done, prohibited characters, case-folding, or canonicalization) is to be done,
it MUST be done before doing the conversion to an ACE name part. it MUST be done before doing the conversion to an ACE name part.
Characters outside the first plane of characters (those with codepoints
above U+FFFF) MUST be represented using surrogates, as described in the
UTF-16 description in ISO 10646.
The input name string consists of characters from the ISO 10646 The input name string consists of characters from the ISO 10646
character set in big-endian UTF-16 encoding. This is the pre-converted character set in big-endian UTF-16 encoding. This is the pre-converted
string. string.
Characters outside the first plane of characters (that is, outside the 2.2.1 Check the input string for disallowed names
first 0xFFFF characters) MUST be represented using surrogates, as
described in the UTF-16 description in ISO 10646.
2.2.1 Compress the pre-converted string If the input string consists only of characters that conform to the host
name requirements in [STD13], the conversion MUST stop with an error.
2.2.2 Compress the pre-converted string
The entire pre-converted string MUST be compressed using the compression The entire pre-converted string MUST be compressed using the compression
algorithm specified in section 2.4. The result of this step is the algorithm specified in section 2.4. The result of this step is the
compressed string. compressed string.
2.2.2 Check the length of the compressed string 2.2.3 Check the length of the compressed string
The compressed string MUST be 36 octets or shorter. If the compressed The compressed string MUST be 36 octets or shorter. If the compressed
string is 37 octets or longer, the conversion MUST stop with an error. string is 37 octets or longer, the conversion MUST stop with an error.
2.2.3 Encode the compressed string with Base32 2.2.4 Encode the compressed string with Base32
The compressed string MUST be converted using the Base32 encoding The compressed string MUST be converted using the Base32 encoding
described in section 2.5. The result of this step is the encoded string. described in section 2.5. The result of this step is the encoded string.
2.2.4 Prepend "bq--" to the encoded string and finish 2.2.5 Prepend "bq--" to the encoded string and finish
Prepend the characters "bq--" to the encoded string. This is the host Prepend the characters "bq--" to the encoded string. This is the host
name part that can be used in DNS resolution. name part that can be used in DNS resolution.
2.3 Converting a host name part to an internationalized name 2.3 Converting a host name part to an internationalized name
The input string for conversion is a valid host name part. Note that if The input string for conversion is a valid host name part. Note that if
any checking for prohibited name parts (such as ones that are already any checking for prohibited name parts (such as prohibited characters,
legal DNS name parts), prohibited characters, case-folding, or case-folding, or canonicalization is to be done, it MUST be done after
canonicalization is to be done, it MUST be done after doing the doing the conversion from an ACE name part.
conversion from an ACE name part. (Previous versions of this draft
specified these steps.) If a decoded name part consists exclusively of characters that conform
to the host name requirements in [STD13], the conversion from LACE MUST
fail. Because a name part that can be represented without LACE MUST NOT
be encoded using LACE, the decoding process MUST check for name parts
that consists exclusively of characters that conform to the host name
requirements in [STD13] and, if such a name part is found, MUST
beconsidered an error (and possibly a security violation).
2.3.1 Strip the "bq--" 2.3.1 Strip the "bq--"
The input string MUST begin with the characters "bq--". If it does not, The input string MUST begin with the characters "bq--". If it does not,
the conversion MUST stop with an error. Otherwise, remove the characters the conversion MUST stop with an error. Otherwise, remove the characters
"bq--" from the input string. The result of this step is the stripped "bq--" from the input string. The result of this step is the stripped
string. string.
2.3.2 Decode the stripped string with Base32 2.3.2 Decode the stripped string with Base32
skipping to change at line 239 skipping to change at line 251
post-converted string. Otherwise, the entire resulting string MUST be post-converted string. Otherwise, the entire resulting string MUST be
converted to a binary format using the Base32 decoding described in converted to a binary format using the Base32 decoding described in
section 2.5. The result of this step is the decoded string. section 2.5. The result of this step is the decoded string.
2.3.3 Decompress the decoded string 2.3.3 Decompress the decoded string
The entire decoded string MUST be converted to ISO 10646 characters The entire decoded string MUST be converted to ISO 10646 characters
using the decompression algorithm described in section 2.4. The result using the decompression algorithm described in section 2.4. The result
of this is the internationalized string. of this is the internationalized string.
2.3.4 Check the internationalized string for disallowed names
If the internationalized string consists only of characters that conform
to the host name requirements in [STD13], the conversion MUST stop with
an error.
2.4 Compression algorithm 2.4 Compression algorithm
The basic method for compression is to reduce a full string that The basic method for compression is to reduce a full string that
consists of characters all from a single row of the ISO 10646 consists of characters all from a single row of the ISO 10646
repertoire, or all from a single row plus from row 0, to as few octets repertoire, or all from a single row plus from row 0, to as few octets
as possible. Any full string that has characters that come from two as possible. Any full string that has characters that come from two
rows, neither of which are row 0, or three or more rows, has all the rows, neither of which are row 0, or three or more rows, has all the
octets of the input string in the output string. octets of the input string in the output string.
If the string comes from only one row, compression is to one octet per If the string comes from only one row, compression is to one octet per
skipping to change at line 280 skipping to change at line 298
Note that the compression and decompression rules MUST be followed Note that the compression and decompression rules MUST be followed
exactly. This requirement prevents a single host name part from having exactly. This requirement prevents a single host name part from having
two encodings. Thus, for any input to the algorithm, there is only one two encodings. Thus, for any input to the algorithm, there is only one
possible output. An implementation cannot chose to use one-octet mode or possible output. An implementation cannot chose to use one-octet mode or
two-octet mode using anything other than the logic given in this two-octet mode using anything other than the logic given in this
section. section.
2.4.1 Compressing a string 2.4.1 Compressing a string
The input string is in UTF-16 encoding with no byte order mark. The input string is in big-endian UTF-16 encoding with no byte order
mark.
Design note: No checking is done on the input to this algorithm. It is Design note: No checking is done on the input to this algorithm. It is
assumed that all checking for valid ISO 10646 characters has already assumed that all checking for valid ISO/IEC 10646 characters has already
been done by a previous step in the conversion process. been done by a previous step in the conversion process.
Design note: In step 5, 0xFF was chosen as the escape character because Design note: In step 5, 0xFF was chosen as the escape character because
it appears in the fewest number of scripts in ISO 10646, and therefore it appears in the fewest number of scripts in ISO 10646, and therefore
the "escaped escape" will be needed the least. 0x99 was chosen as the the "escaped escape" will be needed the least. 0x99 was chosen as the
second octet for the "escaped escape" because the character U+0099 has second octet for the "escaped escape" because the character U+0099 has
no value, and is not even used as a control character in the C1 controls no value, and is not even used as a control character in the C1 controls
or in ISO 6429. or in ISO 6429.
1) Read each pair of octets in the input stream, comparing the upper 1) Starting at the beginning of the input, read each pair of octets in
octet of each. If all of the upper octets (called U1) are the same, go the input stream, comparing the upper octet of each. Reset the input
to step 4. pointer to the beginning of the input again. If all of the upper octets
(called U1) are the same, go to step 4. Note that if the input is only
one character, this test will always be true.
2) Read each pair of octets in the input stream, comparing the upper 2) Read each pair of octets in the input stream, comparing the upper
octet of each. If all of the upper octets are either 0 or one single octet of each. Reset the input pointer to the beginning of the input
other value (called U1), go to step 4. again. If all of the upper octets are either 0x00 or one single other
value (called U1), go to step 4.
3) Output 0xD8, followed by the entire input stream. Finish. 3) Output 0xD8, followed by the entire input stream. Finish.
4) Output U1. 4) If U1 is in the range 0xD8 to 0xDC, stop with an error. Otherwise,
output U1.
5) If you are at the end of the input string, finish. Otherwise, read 5) If you are at the end of the input string, finish. Otherwise, read
the next octet, called U2, and the octet after that, called N1. the next octet, called U2, and the octet after that, called N1. If U2 is
0x00 and N1 is 0x99, stop with an error.
6) If U2 is equal to U1, and N1 is not equal to 0xFF, output N1, and go 6) If U2 is equal to U1, and N1 is not equal to 0xFF, output N1, and go
to step 5. to step 5.
7) If U2 is equal to U1, and N1 is equal to 0xFF, output 0xFF followed 7) If U2 is equal to U1, and N1 is equal to 0xFF, output 0xFF followed
by 0x99, and go to step 5. by 0x99, and go to step 5.
8) Output 0xFF followed by N1. Go to step 5. 8) Output 0xFF followed by N1. Go to step 5.
2.4.2 Decompressing a string 2.4.2 Decompressing a string
1) Read the first octet of the input string. Call the value of the first 1) Read the first octet of the input string. Call the value of the first
octet U1. If U1 is 0xD8, go to step 7. octet U1. If there are no more octets in the input string (that is, if
the input string had only one octet total), stop with an error. If U1 is
0xD8, go to step 8.
2) If you are at the end of the input string, finish. Otherwise, read 2) If you are at the end of the input string, go to step 11. Otherwise,
the next octet in the input string, called N1. If N1 is 0xFF, go to step read the next octet in the input string, called N1. If N1 is 0xFF, go to
4. step 5.
3) Output U1 followed by N1. Go to step 2. 3) If U1 is 0x00 and N1 is 0x99, stop with an error.
4) If you are at the end of the input string, stop with an error. 4) Put U1 followed by N1 in the output buffer. Go to step 2.
5) Read the next octet of the input string, called N1. If N1 is 0x99, 5) If you are at the end of the input string, stop with an error.
output U1 followed by 0xFF, and go to step 2.
6) Output 0x00 followed by N1. Go to step 2. 6) Read the next octet of the input string, called N1. If N1 is 0x99,
put U1 followed by 0xFF in the output buffer, and go to step 2.
7) Read the rest of the input stream and put it in the output stream. 7) Put 0x00 followed by N1 in the output buffer. Go to step 2.
Finish.
8) Read the rest of the input stream into a temporary string called
LCHECK. If the length of LCHECK is an odd number, stop with an error.
9) Perform the checks from steps 1 and 2 of the compression algorithm in
section 2.4.1 on LCHECK. If either checks pass (that is, if either would
have created a compressed string), stop with an error because the input
to the decompression is in the wrong format.
10) If the length of LCHECK is odd, stop with and error. Otherwise,
output LCHECK and finish.
11) If the length of the output buffer is odd, stop with and error.
Otherwise, emit the output buffer and finish.
2.4.3 Compression examples 2.4.3 Compression examples
For the input string of <U+012E><U+0110><U+014A>, all characters are in For the input string of <U+012D><U+0111><U+014B>, all characters are in
the same row, 0x01. Thus, the output is 0x012E104A. the same row, 0x01. Thus, the output is 0x012D114B.
For the input string of <U+012E><U+00D0><U+014A>, the characters are all For the input string of <U+012D><U+00E0><U+014B>, the characters are all
in row 0x01 or row 0x00. Thus, the output is 0x012EFFD04A. in row 0x01 or row 0x00. Thus, the output is 0x012DFFE04B.
For the input string of <U+1290><U+12FF><U+120C>, the characters are all For the input string of <U+1290><U+12FF><U+120C>, the characters are all
in row 0x12. Thus, the output is 0x1290FF990C. in row 0x12. Thus, the output is 0x1290FF990C.
For the input string of <U+012E><U+00D0><U+24C3>, the characters are For the input string of <U+012D><U+00E0><U+24D3>, the characters are
from two rows other than 0x00. Thus, the output is 0xD8012E00D024C3. from two rows other than 0x00. Thus, the output is 0xD8012D00E024D3.
2.5 Base32 2.5 Base32
In order to encode non-ASCII characters in DNS-compatible host name parts, In order to encode non-ASCII characters in DNS-compatible host name parts,
they must be converted into legal characters. This is done with Base32 they must be converted into legal characters. This is done with Base32
encoding, described here. encoding, described here.
Table 1 shows the mapping between input bits and output characters in Table 1 shows the mapping between input bits and output characters in
Base32. Design note: the digits used in Base32 are "2" through "7" Base32. Design note: the digits used in Base32 are "2" through "7"
instead of "0" through "6" in order to avoid digits "0" and "1". This instead of "0" through "6" in order to avoid digits "0" and "1". This
skipping to change at line 392 skipping to change at line 431
2.5.1 Encoding octets as Base32 2.5.1 Encoding octets as Base32
The input is a stream of octets. However, the octets are then treated The input is a stream of octets. However, the octets are then treated
as a stream of bits. as a stream of bits.
Design note: The assumption that the input is a stream of octets Design note: The assumption that the input is a stream of octets
(instead of a stream of bits) was made so that no padding was needed. (instead of a stream of bits) was made so that no padding was needed.
If you are reusing this algorithm for a stream of bits, you must add a If you are reusing this algorithm for a stream of bits, you must add a
padding mechanism in order to differentiate different lengths of input. padding mechanism in order to differentiate different lengths of input.
1) Set the read pointer to the beginning of the input bit stream. 1) If the input bit stream is not an even multiple of five bits, pad
the input stream with 0 bits until it is an even multiple of five bits.
Set the read pointer to the beginning of the input bit stream.
2) Look at the five bits after the read pointer. If there are not five 2) Look at the five bits after the read pointer.
bits, go to step 5.
3) Look up the value of the set of five bits in the bits column of 3) Look up the value of the set of five bits in the bits column of
Table 1, and output the character from the char column (whose hex value Table 1, and output the character from the char column (whose hex value
is in the hex column). is in the hex column).
4) Move the read pointer five bits forward. If the read pointer is at 4) Move the read pointer five bits forward. If the read pointer is at
the end of the input bit stream (that is, there are no more bits in the the end of the input bit stream (that is, there are no more bits in the
input), stop. Otherwise, go to step 2. input), stop. Otherwise, go to step 2.
5) Pad the bits seen until there are five bits.
6) Look up the value of the set of five bits in the bits column of
Table 1, and output the character from the char column (whose hex value
is in the hex column).
2.5.2 Decoding Base32 as octets 2.5.2 Decoding Base32 as octets
The input is octets in network byte order. The input octets MUST be The input is octets in network byte order. The input octets MUST be
values from the second column in Table 1. values from the second column in Table 1.
1) Set the read pointer to the beginning of the input octet stream. 1) Count the number of octets in the input and divide it by 8; call the
remainder INPUTCHECK. If INPUTCHECK is 1 or 3 or 6, stop with an error.
2) Look up the character value of the octet in the char column (or hex 2) Set the read pointer to the beginning of the input octet stream.
value in hex column) of Table 1, and output the five bits from the bits
column.
3) Move the read pointer one octet forward. If the read pointer is at 3) Look up the character value of the octet in the char column (or hex
the end of the input octet stream (that is, there are no more octets in value in hex column) of Table 1, and add the five bits from the bits
the input), stop. Otherwise, go to step 2. column to the output buffer.
4) Move the read pointer one octet forward. If the read pointer is not
at the end of the input octet stream (that is, there are more octets in
the input), go to step 3.
5) Count the number of bits that are in the output buffer and divide it
by 8; call the remainder PADDING. If the PADDING number of bits at the
end of the output buffer are not all zero, stop with an error.
Otherwise, emit the output buffer and stop.
2.5.3 Base32 example 2.5.3 Base32 example
Assume you want to encode the value 0x3a270f93. The bit string is: Assume you want to encode the value 0x3a270f93. The bit string is:
3 a 2 7 0 f 9 3 3 a 2 7 0 f 9 3
00111010 00100111 00001111 10010011 00111010 00100111 00001111 10010011
Broken into chunks of five bits, this is: Broken into chunks of five bits, this is:
skipping to change at line 493 skipping to change at line 535
[STD13] Paul Mockapetris, "Domain names - implementation and [STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035). specification", November 1987, STD 13 (RFC 1035).
[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version [Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
3.0", ISBN 0-201-61633-5. Described at 3.0", ISBN 0-201-61633-5. Described at
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>. <http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
A. Acknowledgements A. Acknowledgements
Mark Davis contributed many ideas to the initial draft of this document. Mark Davis contributed many ideas to the initial draft of this document,
Graham Klyne and Martin Duerst offered technical comments on the as well as comments in later versions. Graham Klyne and Martin Duerst
algorithms used. GIM Gyeongseog and Pongtorn Jentaweepornkul helped fix offered technical comments on the algorithms used. GIM Gyeongseog and
technical errors in early drafts. Pongtorn Jentaweepornkul helped fix technical errors in early drafts.
Rick Wesson and Mark Davis contributed many suggestions on error
conditions in the processing.
Base32 is quite obviously inspired by the tried-and-true Base64 Base32 is quite obviously inspired by the tried-and-true Base64
Content-Transfer-Encoding from MIME. Content-Transfer-Encoding from MIME.
B. Changes from Versions -01 to -02 of this Draft B. Changes from Versions -02 to -03 of this Draft
Removed section 1.3 (open issues) because no one said anything 1: Wording corrections to third paragraph.
in support of either proposal.
Added the prohibition on encoding a string that is already in 2.2 and 2.3: Added need to check for all-STD13.
STD13 format in section 2.2.
2.4.1: Wording corrections in the first two paragraphs. Made step 1 and
2 clearer with resetting the input pointer. Also added sentence at the
end of step 1. Also added error conditions in steps 4 and 5.
2.4.2: Added error condition in step 1. Added a new step 3 for an error
check. Expanded step 8 to check for malformed input error. Added error
check for odd-length output.
2.4.3: Changed all the examples to use lowercase characters on input.
2.5.1: Made the list of steps shorter by padding with 0 bits at the
beginning of the steps.
2.5.2: Changed the sense of the test in step 3 and added step 4 to be
checkfor malformed input. Also made the output a buffer. Also added
new step 1.
C. IANA Considerations C. IANA Considerations
There are no IANA considerations in this document. There are no IANA considerations in this document.
D. Author Contact Information D. Author Contact Information
Paul Hoffman Paul Hoffman
Internet Mail Consortium and VPN Consortium Internet Mail Consortium and VPN Consortium
127 Segre Place 127 Segre Place
 End of changes. 37 change blocks. 
71 lines changed or deleted 130 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/