draft-ietf-idn-race-00.txt   draft-ietf-idn-race-01.txt 
Internet Draft Paul Hoffman Internet Draft Paul Hoffman
draft-ietf-idn-race-00.txt IMC & VPNC draft-ietf-idn-race-01.txt IMC & VPNC
June 18, 2000 August 31, 2000
Expires in six months Expires in six months
RACE: Row-based ASCII Compatible Encoding for IDN RACE: Row-based ASCII Compatible Encoding for IDN
Status of this memo Status of this memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026. provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts. groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
To view the list Internet-Draft Shadow Directories, see The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Abstract Abstract
This document describes a transformation method for representing This document describes a transformation method for representing
non-ASCII characters in host name parts in a fashion that is completely non-ASCII characters in host name parts in a fashion that is completely
compatible with the current DNS. It is a potential candidate for an compatible with the current DNS. It is a potential candidate for an
ASCII-Compatible Encoding (ACE) for internationalized host names, as ASCII-Compatible Encoding (ACE) for internationalized host names, as
described in the comparison document from the IETF IDN Working Group. described in the comparison document from the IETF IDN Working Group.
This method is based on the observation that many internationalized This method is based on the observation that many internationalized
skipping to change at line 73 skipping to change at line 76
The RACE protocol has the following features: The RACE protocol has the following features:
- There is exactly one way to convert internationalized host parts to - There is exactly one way to convert internationalized host parts to
and from RACE parts. Host name part uniqueness is preserved. and from RACE parts. Host name part uniqueness is preserved.
- Host parts that have no international characters are not changed. - Host parts that have no international characters are not changed.
- Names using RACE can include more internationalized characters than - Names using RACE can include more internationalized characters than
with other ACE protocols that have been suggested to date. Names in the with other ACE protocols that have been suggested to date. Names in the
Han, Yi, Hangul syllables, or Ethiopic scripts can have up to 18 Han, Yi, Hangul syllables, or Ethiopic scripts can have up to 17
characters, and names in most other scripts can have up to 36 characters, and names in most other scripts can have up to 35
characters. Further, a name that consist of characters from one characters. Further, a name that consist of characters from one
non-Latin script but also contains some Latin characters such as digits non-Latin script but also contains some Latin characters such as digits
or hyphens can have close to 36 characters. or hyphens can have close to 33 characters.
It is important to note that the following sections contain many It is important to note that the following sections contain many
normative statements with "MUST" and "MUST NOT". Any implementation that normative statements with "MUST" and "MUST NOT". Any implementation that
does not follow these statements exactly is likely to cause damage to does not follow these statements exactly is likely to cause damage to
the Internet by creating non-unique representations of host names. the Internet by creating non-unique representations of host names.
1.1 Terminology 1.1 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119 "MAY" in this document are to be interpreted as described in RFC 2119
skipping to change at line 114 skipping to change at line 117
1.2 IDN summary 1.2 IDN summary
Using the terminology in [IDNComp], RACE specifies an ACE format as Using the terminology in [IDNComp], RACE specifies an ACE format as
specified in ace-1. Further, it specifies an identifying mechanism for specified in ace-1. Further, it specifies an identifying mechanism for
ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginning ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginning
of the name part). of the name part).
RACE has the following length characteristics. In this list, "row" means RACE has the following length characteristics. In this list, "row" means
a row from ISO 10646. a row from ISO 10646.
- If the characters in the input all come from the same row, up to 36 - If the characters in the input all come from the same row, up to 35
characters per name part are allowed. characters per name part are allowed.
- If the characters in the input come from two or more rows, neither of - If the characters in the input come from two or more rows, neither of
which is row 0, up to 18 characters per name part are allowed. which is row 0, up to 17 characters per name part are allowed.
- If the characters in the input come from two rows, one of which is row - If the characters in the input come from two rows, one of which is row
0, between 18 and 35 characters per name part are allowed. 0, between 17 and 33 characters per name part are allowed.
1.3 Open issues 1.3 Open issues
Is it OK in 2.3.2 to say "0 MAY be converted to O and that 1 MAY be Is it OK in 2.3.2 to say "0 MAY be converted to O and that 1 MAY be
converted to l"? converted to l"?
Do we want to leave the unused characters 0, 1, 8, and 9 "reserved" in Do we want to leave the unused characters 0, 1, 8, and 9 "reserved" in
Base32 instead of making them prohibited now? This allows creative Base32 instead of making them prohibited now? This allows creative
expansion in the future. expansion in the future.
skipping to change at line 144 skipping to change at line 147
According to [STD13], host parts must be case-insensitive, start and According to [STD13], host parts must be case-insensitive, start and
end with a letter or digit, and contain only letters, digits, and the end with a letter or digit, and contain only letters, digits, and the
hyphen character ("-"). This, of course, excludes any internationalized hyphen character ("-"). This, of course, excludes any internationalized
characters, as well as many other characters in the ASCII character characters, as well as many other characters in the ASCII character
repertoire. Further, domain name parts must be 63 octets or shorter in repertoire. Further, domain name parts must be 63 octets or shorter in
length. length.
2.1 Name tagging 2.1 Name tagging
All post-converted name parts that contain internationalized characters All post-converted name parts that contain internationalized characters
begin with the string "ra--". (Of course, because host name parts are begin with the string "bq--". (Of course, because host name parts are
case-insensitive, this might also be represented as "Ra--" or "rA--" or case-insensitive, this might also be represented as "Bq--" or "bQ--" or
"RA--".) The string "ra--" was chosen because it is extremely unlikely "BQ--".) The string "bq--" was chosen because it is extremely unlikely
to exist in host parts before this specification was produced. As a to exist in host parts before this specification was produced. As a
historical note, in mid-April 2000, none of the second-level host name historical note, in late August 2000, none of the second-level host name
parts in any of the .com, .edu, .net, and .org top-level domains began parts in any of the .com, .edu, .net, and .org top-level domains began
with "ra--"; there are about 36,000 other strings of three characters with "bq--"; there are many tens of thousands of other strings of three
followed by a hyphen that have this property and could be used instead. characters followed by a hyphen that have this property and could be
used instead. The string "bq--" will change to other strings with the
same properties in future versions of this draft.
Note that a zone administrator might still choose to use "ra--" at the Note that a zone administrator might still choose to use "bq--" at the
beginning of a host name part even if that part does not contain beginning of a host name part even if that part does not contain
internationalized characters. Zone administrators SHOULD NOT create host internationalized characters. Zone administrators SHOULD NOT create host
part names that begin with "ra--" unless those names are post-converted part names that begin with "bq--" unless those names are post-converted
names. Creating host part names that begin with "ra--" but that are not names. Creating host part names that begin with "bq--" but that are not
post-converted names may cause two distinct problems. Some display post-converted names may cause two distinct problems. Some display
systems, after converting the post-converted name part back to an systems, after converting the post-converted name part back to an
internationalized name part, might display the name parts in a internationalized name part, might display the name parts in a
possibly-confusing fashion to users. More seriously, some resolvers, possibly-confusing fashion to users. More seriously, some resolvers,
after converting the post-converted name part back to an after converting the post-converted name part back to an
internationalized name part, might reject the host name if it contains internationalized name part, might reject the host name if it contains
illegal characters. illegal characters.
2.2 Converting an internationalized name to an ACE name part 2.2 Converting an internationalized name to an ACE name part
skipping to change at line 202 skipping to change at line 207
2.2.2 Check the length of the compressed string 2.2.2 Check the length of the compressed string
The compressed string MUST be 36 octets or shorter. If the compressed The compressed string MUST be 36 octets or shorter. If the compressed
string is 37 octets or longer, the conversion MUST stop with an error. string is 37 octets or longer, the conversion MUST stop with an error.
2.2.3 Encode the compressed string with Base32 2.2.3 Encode the compressed string with Base32
The compressed string MUST be converted using the Base32 encoding The compressed string MUST be converted using the Base32 encoding
described in section 2.5. The result of this step is the encoded string. described in section 2.5. The result of this step is the encoded string.
2.2.4 Prepend "ra--" to the encoded string and finish 2.2.4 Prepend "bq--" to the encoded string and finish
Prepend the characters "ra--" to the encoded string. This is the host Prepend the characters "bq--" to the encoded string. This is the host
name part that can be used in DNS resolution. name part that can be used in DNS resolution.
2.3 Converting a host name part to an internationalized name 2.3 Converting a host name part to an internationalized name
The input string for conversion is a valid host name part. Note that if The input string for conversion is a valid host name part. Note that if
any checking for prohibited name parts (such as ones that are already any checking for prohibited name parts (such as ones that are already
legal DNS name parts), prohibited characters, case-folding, or legal DNS name parts), prohibited characters, case-folding, or
canonicalization is to be done, it MUST be done after doing the canonicalization is to be done, it MUST be done after doing the
conversion from an ACE name part. (Previous versions of this draft conversion from an ACE name part. (Previous versions of this draft
specified these steps.) specified these steps.)
2.3.1 Strip the "ra--" 2.3.1 Strip the "bq--"
The input string MUST begin with the characters "ra--". If it does not, The input string MUST begin with the characters "bq--". If it does not,
the conversion MUST stop with an error. Otherwise, remove the characters the conversion MUST stop with an error. Otherwise, remove the characters
"ra--" from the input string. The result of this step is the stripped "bq--" from the input string. The result of this step is the stripped
string. string.
2.3.2 Decode the stripped string with Base32 2.3.2 Decode the stripped string with Base32
The entire stripped string MUST be checked to see if it is valid Base32 The entire stripped string MUST be checked to see if it is valid Base32
output. The entire stripped string MUST be changed to all lower-case output. The entire stripped string MUST be changed to all lower-case
letters and digits. If any resulting characters are not in Table 1, the letters and digits. If any resulting characters are not in Table 1, the
conversion MUST stop with an error; the input string is the conversion MUST stop with an error; the input string is the
post-converted string. Otherwise, the entire resulting string MUST be post-converted string. Otherwise, the entire resulting string MUST be
converted to a binary format using the Base32 decoding described in converted to a binary format using the Base32 decoding described in
skipping to change at line 267 skipping to change at line 272
characters. If the string comes from only one row other than row 0, but characters. If the string comes from only one row other than row 0, but
also has characters only from row 0, the header octet is the upper octet also has characters only from row 0, the header octet is the upper octet
of the characters from the non-0 row. Otherwise, the header octet is of the characters from the non-0 row. Otherwise, the header octet is
0xD8, which is the upper octet of a surrogate pair. Design note: It is 0xD8, which is the upper octet of a surrogate pair. Design note: It is
impossible to have a legal stream of UTF-16 characters that has all the impossible to have a legal stream of UTF-16 characters that has all the
upper octets being 0xD8 because a character whose upper octet is 0xD8 upper octets being 0xD8 because a character whose upper octet is 0xD8
must be followed by one whose upper octet is in the range 0xDC through must be followed by one whose upper octet is in the range 0xDC through
0xDF. 0xDF.
Although the two-octet mode limits the number of characters in a RACE Although the two-octet mode limits the number of characters in a RACE
name part to 18, this is still generally enough for almost all names in name part to 17, this is still generally enough for almost all names in
almost scripts. Also, this limit is close to the limits set by other almost scripts. Also, this limit is close to the limits set by other
encoding proposals. encoding proposals.
Note that the compression and decompression rules MUST be followed Note that the compression and decompression rules MUST be followed
exactly. This requirement prevents a exactly. This requirement prevents a single host name part from having
single host name part from having two encodings. Thus, for any input two encodings. Thus, for any input to the algorithm, there is only one
to the algorithm, there is only one possible output. An implementation possible output. An implementation cannot chose to use one-octet mode or
cannot chose to use one-octet mode or two-octet mode using anything two-octet mode using anything other than the logic given in this
other than the logic given in this section. section.
2.4.1 Compressing a string 2.4.1 Compressing a string
The input string is in UTF-16 encoding with no byte order mark.
Design note: No checking is done on the input to this algorithm. It is Design note: No checking is done on the input to this algorithm. It is
assumed that all checking for valid ISO 10646 characters has already assumed that all checking for valid ISO 10646 characters has already
been done by a previous step in the conversion process. been done by a previous step in the conversion process.
Design note: In step 5, 0xFF was chosen as the escape character because Design note: In step 5, 0xFF was chosen as the escape character because
it appears in the fewest number of scripts in ISO 10646, and therefore it appears in the fewest number of scripts in ISO 10646, and therefore
the "escaped escape" will be needed the least. 0x99 was chosen as the the "escaped escape" will be needed the least. 0x99 was chosen as the
second octet for the "escaped escape" because the character U+0099 has second octet for the "escaped escape" because the character U+0099 has
no value, and is not even used as a control character in the C1 controls no value, and is not even used as a control character in the C1 controls
or in ISO 6429. or in ISO 6429.
1) Read each character in the input stream, comparing the upper octet of 1) Read each pair of octets in the input stream, comparing the upper
each. If all of the upper octets (called U1) are the same, go to step 4. octet of each. If all of the upper octets (called U1) are the same, go
to step 4.
2) Read each character in the input stream, comparing the upper octet of 2) Read each pair of octets in the input stream, comparing the upper
each. If all of the upper octets are either 0 or one single other value octet of each. If all of the upper octets are either 0 or one single
(called U1), go to step 5. other value (called U1), go to step 4.
3) Output 0xD8, followed by the entire input stream. Finish. 3) Output 0xD8, followed by the entire input stream. Finish.
4) Output U1. Output the lower octet of each character in the input. 4) Output U1.
Finish.
5) Output U1.
6) If you are at the end of the input string, finish. Otherwise, read 5) If you are at the end of the input string, finish. Otherwise, read
the next octet, called U2, and the octet after that, called N1. the next octet, called U2, and the octet after that, called N1.
7) If U2 is equal to U1, and N1 is not equal to 0xFF, output N1, and go 6) If U2 is equal to U1, and N1 is not equal to 0xFF, output N1, and go
to step 6. to step 5.
8) If U2 is equal to U1, and N1 is equal to 0xFF, output 0xFF followed 7) If U2 is equal to U1, and N1 is equal to 0xFF, output 0xFF followed
by 0x99, and go to step 6. by 0x99, and go to step 5.
9) Output 0xFF followed by N1. Go to step 6. 8) Output 0xFF followed by N1. Go to step 5.
2.4.2 Decompressing a string 2.4.2 Decompressing a string
1) Read the first octet of the input string. Call the value of the first 1) Read the first octet of the input string. Call the value of the first
octet U1. If U1 is 0xD8, go to step 7. octet U1. If U1 is 0xD8, go to step 7.
2) If you are at the end of the input string, finish. Otherwise, read 2) If you are at the end of the input string, finish. Otherwise, read
the next octet in the input string, called N1. If N1 is 0xFF, go to step the next octet in the input string, called N1. If N1 is 0xFF, go to step
4. 4.
skipping to change at line 493 skipping to change at line 498
[STD13] Paul Mockapetris, "Domain names - implementation and [STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035). specification", November 1987, STD 13 (RFC 1035).
[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version [Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
3.0", ISBN 0-201-61633-5. Described at 3.0", ISBN 0-201-61633-5. Described at
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>. <http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
A. Acknowledgements A. Acknowledgements
Mark Davis contributed many ideas to the initial draft of this Mark Davis contributed many ideas to the initial draft of this document.
document. Graham Klyne and Martin Duerst offered technical comments on Graham Klyne and Martin Duerst offered technical comments on the
the algorithms used. algorithms used. GIM Gyeongseog and Pongtorn Jentaweepornkul helped fix
technical errors in early drafts.
Base32 is quite obviously inspired by the tried-and-true Base64 Base32 is quite obviously inspired by the tried-and-true Base64
Content-Transfer-Encoding from MIME. Content-Transfer-Encoding from MIME.
B. Changes from Previous Versions of this Draft B. Changes from Versions -00 to -01 of this Draft
B.1 Changed from -03 to -04
This version of the document is radically changed to make it just a
template for an ACE, not a potential full IDN protocol. I believe
that a combination protocol that uses both binary on the wire and
an ACE is a better solution than an ACE-only protocol.
Title: Changed completely.
Abstract: Reworded completely.
Throughout: changed "aq8" to "ra--".
Throughout: changed "domain name" to "host name" where appropriate
(which was almost everywhere).
1: Reworded the beginning to narrow the scope.
1.2: Added this section.
1.3: Added the "open issues" section.
2: Moved the first paragraph up to section 1.
2.1: Added discussion of rejection problems with improper name tagging.
2.2: Removed all pre-checking, and put this into the process that
calls RACE.
2.2.1: Removed.
2.2.2: Removed.
2.2.3: Removed. Renumbered 2.2.4 through 2.2.7 to 2.2.1 through
2.2.4.
2.2.5 (old): Changed the values to 36 to reflect the correct maximum.
2.2.6 (old): Shortened the first sentence.
2.3: Removed all post-checking, and put this into the process that
calls RACE.
2.3.1: Changed to make failure here an error.
2.3.2: Changed to make failure here an error.
2.3.4: Removed.
2.4: Changed the algorithm to be better optimized for strings
that come from one row plus row 0. This caused a change in almost
everything in 2.4, 2.4.1, and 2.4.2.
2.4.3: Added this section of examples.
2.5: Renumbered Table 2 to Table 1.
2.5.3: Added padding step to the example.
3: Removed entire section. Renumbered 4 (Security Considerations) to
3 and renumbered 5 (References) to 4.
5: Added [IDNComp]. Removed [Norm]. Removed [RFC2278] and [UnicodeData].
B.2 Changes from -02 to -03
Throughout: changed "wg4" to "aq8".
2.2: Updated the first design note to indicate that the table
will probably be moved to its own draft.
2.2.3: Changed reference for normalization from [UTR15] to [Norm].
5: Updated the reference for [IDNReq]. Removed [UTR15] and replaced
it with [Norm].
B.3 Changes from -01 to -02
Throughout: Changed "ph6" to "wg4".
2.1: Updated count of unused three-letter prefixes.
2.3: Removed all the error states and clarified that any error in
conversion means that the input string is the post-converted
string.
2.4: Radically changed the compression scheme; the previous one
was far too cumbersome.
2.5: Renumbered Table 3 to Table 2.
2.5.1: Changed the second paragraph (should have been done in
the change to -01 to remove padding).
3.2: Clarified the paragraph emphasizing the need for users to be able
to copy names even if they are not displayable.
5: Removed reference to [UTR6].
A: Added Martin Duerst. Removed reference to the compression
algorithm because it has changed.
B.4 Changes from -00 to -01
Throughout: Changed references to the character set from Unicode
to ISO 10646, even though they are equivalent. Also changed
references to the rules for surrogate pairs to ISO 10646.
1.1: Clarified last paragraph. Throughout: Changed "ra--" to "bq--".
2.2: Reworded the first design note to make excluding case stuff Throughout: Fixed minor typos.
more likely.
2.5: Removed the "8" padding in the Base32 algorithm because 1: Fixed the lengths allowed.
it was superfluous.
2.5.1: Removed "in network byte order" from the first sentence 1.3: Fixed the lengths allowed.
because it was redundant.
3.3: Made the first paragraph stronger. 2.1: Added note about changing the actual prefix in future versions of
the draft.
5: Added reference to ISO 10646. This still needs work. 2.4.1: Added first sentence. Changed steps that talked about characters
to instead use "pair of octets". Fixed problem with the steps which
caused bad output in some cases.
A: Added Graham Klyne. A: Added thanks to GIM Gyeongseog and Pongtorn Jentaweepornkul.
C. IANA Considerations C. IANA Considerations
There are no IANA considerations in this document. There are no IANA considerations in this document.
D. Author Contact Information D. Author Contact Information
Paul Hoffman Paul Hoffman
Internet Mail Consortium and VPN Consortium Internet Mail Consortium and VPN Consortium
127 Segre Place 127 Segre Place
 End of changes. 36 change blocks. 
166 lines changed or deleted 64 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/