draft-ietf-idn-punycode-00.txt | draft-ietf-idn-punycode-01.txt | |||
---|---|---|---|---|

INTERNET-DRAFT Adam M. Costello | INTERNET-DRAFT Adam M. Costello | |||

draft-ietf-idn-punycode-00.txt 2002-Jan-06 | draft-ietf-idn-punycode-01.txt 2002-Feb-24 | |||

Expires 2002-Jul-06 | Expires 2002-Aug-24 | |||

Punycode version 0.3.3 | Punycode: An encoding of Unicode for use with IDNA | |||

Status of this Memo | Status of this Memo | |||

This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||

all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||

Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||

Task Force (IETF), its areas, and its working groups. Note | Task Force (IETF), its areas, and its working groups. Note | |||

that other groups may also distribute working documents as | that other groups may also distribute working documents as | |||

Internet-Drafts. | Internet-Drafts. | |||

skipping to change at line 28 | skipping to change at line 28 | |||

months and may be updated, replaced, or obsoleted by other documents | months and may be updated, replaced, or obsoleted by other documents | |||

at any time. It is inappropriate to use Internet-Drafts as | at any time. It is inappropriate to use Internet-Drafts as | |||

reference material or to cite them other than as "work in progress." | reference material or to cite them other than as "work in progress." | |||

The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||

http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||

The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||

http://www.ietf.org/shadow.html | http://www.ietf.org/shadow.html | |||

Distribution of this document is unlimited. Please send comments | Distribution of this document is unlimited. | |||

to the author at amc@cs.berkeley.edu, or to the idn working | ||||

group at idn@ops.ietf.org. A non-paginated (and possibly | ||||

newer) version of this specification may be available at | ||||

http://www.cs.berkeley.edu/~amc/idn/ | ||||

Abstract | Abstract | |||

Punycode is a simple and efficient encoding designed for use with | Punycode is a simple and efficient transfer encoding syntax designed | |||

Internationalized Domain Names [IDN] [IDNA]. It uniquely and | for use with Internationalized Domain Names in Applications [IDNA]. | |||

reversibly transforms a Unicode string [UNICODE] into an ASCII | It uniquely and reversibly transforms a Unicode string [UNICODE] | |||

string. ASCII characters in the Unicode string are represented | into an ASCII string. ASCII characters in the Unicode string are | |||

literally, and non-ASCII characters are represented by ASCII | represented literally, and non-ASCII characters are represented by | |||

characters that are allowed in hostname labels (letters, digits, | ASCII characters that are allowed in host name labels (letters, | |||

and hyphens). Bootstring is a general algorithm that allows a | digits, and hyphens). Bootstring is a general algorithm that allows | |||

string of basic code points to uniquely represent any string of code | a string of basic code points to uniquely represent any string of | |||

points drawn from a larger set. Punycode is an instance Bootstring | code points drawn from a larger set. Punycode is an instance of | |||

that uses particular parameter values appropriate for IDNA. This | Bootstring that uses particular parameter values appropriate for | |||

document specifies Bootstring and the parameter values for Punycode. | IDNA. This document specifies Bootstring and the parameter values | |||

for Punycode. | ||||

Contents | Contents | |||

1. Introduction | 1. Introduction | |||

1.1 Features | ||||

1.2 Interaction of protocol parts | ||||

2. Terminology | 2. Terminology | |||

3. Bootstring description | 3. Bootstring description | |||

3.1 Basic code point segregation | 3.1 Basic code point segregation | |||

3.2 Insertion unsort coding | 3.2 Insertion unsort coding | |||

3.3 Generalized variable-length integers | 3.3 Generalized variable-length integers | |||

3.4 Bias adaptation | 3.4 Bias adaptation | |||

4. Bootstring parameters | 4. Bootstring parameters | |||

5. Parameter values for Punycode | 5. Parameter values for Punycode | |||

6. Bootstring algorithms | 6. Bootstring algorithms | |||

6.1 Bias adaptation function | 6.1 Bias adaptation function | |||

6.2 Decoding procedure | 6.2 Decoding procedure | |||

6.3 Encoding procedure | 6.3 Encoding procedure | |||

6.4 Alternative methods for handling overflow | 6.4 Alternative methods for handling overflow | |||

7. Punycode example strings | 7. Punycode examples | |||

7.1 Sample strings | ||||

7.2 Decoding traces | ||||

7.3 Encoding traces | ||||

8. Security considerations | 8. Security considerations | |||

9. References | 9. References | |||

A. Author contact information | A. Author contact information | |||

B. Mixed-case annotation | B. Mixed-case annotation | |||

C. Disclaimer and License | C. Disclaimer and license | |||

D. Punycode sample implementation | D. Punycode sample implementation | |||

1. Introduction | 1. Introduction | |||

The IDNA draft [IDNA] describes an architecture for supporting | [IDNA] describes an architecture for supporting internationalized | |||

internationalized domain names. Labels containing non-ASCII | domain names. Labels containing non-ASCII characters can be | |||

characters can be represented by ACE labels, which begin with a | represented by ACE labels, which begin with a special prefix and | |||

special prefix and contain only ASCII characters. The remainder | contain only ASCII characters. The remainder of the label after the | |||

of the label after the prefix is a Punycode encoding of a Unicode | prefix is a Punycode encoding of a Unicode string satisfying certain | |||

string satisfying certain constraints. For the details of the | constraints. For the details of the prefix and constraints, see | |||

prefix and constraints, see [IDNA] and [NAMEPREP]. | [IDNA] and [NAMEPREP]. | |||

1.1 Features | ||||

Bootstring has been designed to have the following features: | Bootstring has been designed to have the following features: | |||

* Completeness: Every extended string (sequence of arbitrary code | * Completeness: Every extended string (sequence of arbitrary code | |||

points) can be represented by a basic string (sequence of basic | points) can be represented by a basic string (sequence of basic | |||

code points). Restrictions on what strings are allowed, and on | code points). Restrictions on what strings are allowed, and on | |||

length, may be imposed by higher layers. | length, can be imposed by higher layers. | |||

* Uniqueness: There is at most one basic string that represents a | * Uniqueness: There is at most one basic string that represents a | |||

given extended string. | given extended string. | |||

* Reversibility: Any extended string mapped to a basic string can | * Reversibility: Any extended string mapped to a basic string can | |||

be recovered from that basic string. | be recovered from that basic string. | |||

* Efficient encoding: The ratio of extended string length to | * Efficient encoding: The ratio of basic string length to | |||

basic string length is small. This is important in the context | extended string length is small. This is important in the | |||

of domain names because RFC 1034 [RFC1034] restricts the length | context of domain names because RFC 1034 [RFC1034] restricts the | |||

of a domain label to 63 characters. | length of a domain label to 63 characters. | |||

* Simplicity: The encoding and decoding algorithms are reasonably | * Simplicity: The encoding and decoding algorithms are reasonably | |||

simple to implement. The goals of efficiency and simplicity are | simple to implement. The goals of efficiency and simplicity are | |||

at odds; Bootstring aims at a good balance between them. | at odds; Bootstring aims at a good balance between them. | |||

* Readability: Basic code points appearing in the extended | * Readability: Basic code points appearing in the extended string | |||

string are represented as themselves in the basic string. This | are represented as themselves in the basic string (although the | |||

comes for free because it makes the encoding more efficient on | main purpose is to improve efficiency, not readability). | |||

average. | ||||

In addition, Punycode can support an optional feature described in | Punycode can also support an additional feature described in | |||

appendix B "Mixed-case annotation". | appendix B "Mixed-case annotation". | |||

1.2 Interaction of protocol parts | ||||

Punycode is used by the IDNA protocol [IDNA] for converting domain | ||||

labels into ASCII; it is not designed for any other purpose. It is | ||||

explicitly not designed for processing arbitrary free text. | ||||

2. Terminology | 2. Terminology | |||

The key words "must", "shall", "required", "should", "recommended", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||

and "may" in this document are to be interpreted as described in RFC | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||

2119 [RFC2119]. | document are to be interpreted as described in RFC 2119 [RFC2119]. | |||

A code point is an integral value associated with a character in a | A code point is an integral value associated with a character in a | |||

coded character set. | coded character set. | |||

As in the Unicode Standard [UNICODE], Unicode code points are | As in the Unicode Standard [UNICODE], Unicode code points are | |||

denoted by "U+" followed by four to six hexadecimal digits, while a | denoted by "U+" followed by four to six hexadecimal digits, while a | |||

range of code points is denoted by two hexadecimal numbers separated | range of code points is denoted by two hexadecimal numbers separated | |||

by "..", with no prefixes. | by "..", with no prefixes. | |||

The operators div and mod perform integer division; (x div y) is the | The operators div and mod perform integer division; (x div y) is the | |||

skipping to change at line 140 | skipping to change at line 149 | |||

and remainder are always nonnegative. | and remainder are always nonnegative. | |||

The break statement jumps out of the innermost loop (as in C). | The break statement jumps out of the innermost loop (as in C). | |||

An overflow is an attempt to compute a value that exceeds the | An overflow is an attempt to compute a value that exceeds the | |||

maximum value of an integer variable. | maximum value of an integer variable. | |||

3. Bootstring description | 3. Bootstring description | |||

Bootstring represents an arbitrary sequence of code points (the | Bootstring represents an arbitrary sequence of code points (the | |||

"extended string") as a sequence of basic code points (the | "extended string") as a sequence of basic code points (the "basic | |||

"basic string"). This section describes the representation. | string"). This section describes the representation. Section 6 | |||

Section 6 "Bootstring algorithms" presents the algorithms as | "Bootstring algorithms" presents the algorithms as pseudocode. | |||

pseudocode. | Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the | |||

algorithms for sample inputs. | ||||

The following sections describe the four techniques used in | ||||

Bootstring. "Basic code point segregation" is a very simple | ||||

and efficient encoding for basic code points occurring in the | ||||

extended string: they are simply copied all at once. "Insertion | ||||

unsort coding" encodes the non-basic code points as deltas, and | ||||

processes the code points in numerical order rather than in order of | ||||

appearance, which typically results in smaller deltas. The deltas | ||||

are represented as "generalized variable-length integers", which use | ||||

basic code points to represent nonnegative integers. The parameters | ||||

of this integer representation are dynamically adjusted using "bias | ||||

adaptation", to improve efficiency when consecutive deltas have | ||||

similar magnitudes. | ||||

3.1 Basic code point segregation | 3.1 Basic code point segregation | |||

All basic code points appearing in the extended string are | All basic code points appearing in the extended string are | |||

represented literally at the beginning of the basic string, in their | represented literally at the beginning of the basic string, in their | |||

original order, followed by a delimiter if (and only if) the number | original order, followed by a delimiter if (and only if) the number | |||

of basic code points is nonzero. The delimiter is a particular | of basic code points is nonzero. The delimiter is a particular | |||

basic code point, which never appears in the remainder of the basic | basic code point, which never appears in the remainder of the basic | |||

string. The decoder can therefore find the end of the literal | string. The decoder can therefore find the end of the literal | |||

portion (if there is one) by scanning for the last delimiter. | portion (if there is one) by scanning for the last delimiter. | |||

3.2 Insertion unsort coding | 3.2 Insertion unsort coding | |||

The remainder of the basic string (after the last delimiter if there | The remainder of the basic string (after the last delimiter if there | |||

is one) represents a sequence of nonnegative integral deltas as | is one) represents a sequence of nonnegative integral deltas as | |||

generalized variable-length integers, described in section 3.3. The | generalized variable-length integers, described in section 3.3. The | |||

meaning of the deltas is best understood in terms of the decoder. | meaning of the deltas is best understood in terms of the decoder. | |||

The decoder builds the extended string incrementally. Initially, | The decoder builds the extended string incrementally. Initially, | |||

the extended string is a copy of the literal portion of the basic | the extended string is a copy of the literal portion of the basic | |||

string (excluding the last delimiter). Each delta causes the | string (excluding the last delimiter). The decoder inserts | |||

decoder to insert a code point into the extended string according | non-basic code points, one for each delta, into the extended string, | |||

to the following procedure. There are two state variables: a | ultimately arriving at the final decoded string. | |||

code point n, and an index i that ranges from zero (which is the | ||||

first position of the extended string) to the current length of | At the heart of this process is a state machine with two state | |||

the extended string (which refers to a potential position beyond | variables: an index i and a counter n. The index i refers to | |||

the current end). The decoder advances the state monotonically | a position in the extended string; it ranges from 0 (the first | |||

(never returning to an earlier state) by taking steps only upward. | position) to the current length of the extended string (which refers | |||

Each step increments i, except when i already equals the length | to a potential position beyond the current end). If the current | |||

of the extended string, in which case a step resets i to zero | state is <n,i>, the next state is <n,i+1> if i is less than the | |||

and increments n. For each delta (in order), the decoder takes | length of the extended string, or <n+1,0> if i equals the length of | |||

delta steps upward, then inserts the value n into the extended | the extended string. In other words, each state change causes i to | |||

string at position i, then increments i (to skip over the code | increment, wrapping around to zero if necessary, and n counts the | |||

point just inserted). (An implementation should not take each | number of wrap-arounds. | |||

step individually, but should insead use division and remainder | ||||

calculations to advance by delta steps all at once.) It is an error | Notice that the state always advances monotonically (there is no | |||

if the inserted code point is a basic code point (because basic code | way for the decoder to return to an earlier state). At each state, | |||

points must be segregated as described in section 3.1). | an insertion is either performed or not performed. At most one | |||

insertion is performed in a given state. An insertion inserts the | ||||

value of n at position i in the extended string. The deltas are | ||||

a run-length encoding of this sequence of events: they are the | ||||

lengths of the runs of non-insertion states preceeding the insertion | ||||

states. Hence, for each delta, the decoder performs delta state | ||||

changes, then an insertion, and then one more state change. (An | ||||

implementation need not perform each state change individually, but | ||||

can instead use division and remainder calculations to compute the | ||||

next insertion state directly.) It is an error if the inserted code | ||||

point is a basic code point (because basic code points were supposed | ||||

to be segregated as described in section 3.1). | ||||

The encoder's main task is to derive the sequence of deltas that | The encoder's main task is to derive the sequence of deltas that | |||

will cause the decoder to construct the desired string. It can do | will cause the decoder to construct the desired string. It can do | |||

this by repeatedly scanning the extended string for the next code | this by repeatedly scanning the extended string for the next code | |||

point that the decoder would need to insert, and counting the number | point that the decoder would need to insert, and counting the number | |||

of steps the decoder would need to take, mindful of the fact that | of state changes the decoder would need to perform, mindful of the | |||

the decoder will be stepping over only those code points that have | fact that the decoder's extended string will include only those | |||

already been inserted. Section 6.3 "Encoding procedure" gives a | code points that have already been inserted. Section 6.3 "Encoding | |||

precise algorithm. | procedure" gives a precise algorithm. | |||

3.3 Generalized variable-length integers | 3.3 Generalized variable-length integers | |||

In a conventional integer representation the base is the number of | In a conventional integer representation the base is the number of | |||

distinct symbols for digits, whose values are 0 through base-1. Let | distinct symbols for digits, whose values are 0 through base-1. Let | |||

digit_0 denote the least significant digit, digit_1 the next least | digit_0 denote the least significant digit, digit_1 the next least | |||

significant, and so on. The value represented is the sum over j of | significant, and so on. The value represented is the sum over j of | |||

digit_j * w(j), where w(j) = base^j is the weight (scale factor) | digit_j * w(j), where w(j) = base^j is the weight (scale factor) | |||

for position j. For example, in the base 8 integer 437, the digits | for position j. For example, in the base 8 integer 437, the digits | |||

are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is | are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is | |||

7 + 3*8 + 4*64 = 287. This representation has two disadvantages: | 7 + 3*8 + 4*64 = 287. This representation has two disadvantages: | |||

First, there are multiple encodings of each value (because there | First, there are multiple encodings of each value (because there | |||

can be extra zeros in the most significant positions), which | can be extra zeros in the most significant positions), which is | |||

is inconvenient when unique encodings are required. Second, | inconvenient when unique encodings are needed. Second, the integer | |||

the integer is not self-delimiting, so if multiple integers are | is not self-delimiting, so if multiple integers are concatenated the | |||

concatenated the boundaries between them are lost. | boundaries between them are lost. | |||

The generalized variable-length representation solves these two | The generalized variable-length representation solves these two | |||

problems. The digit values are still 0 through base-1, but now | problems. The digit values are still 0 through base-1, but now | |||

the integer is self-delimiting by means of thresholds t(j), each | the integer is self-delimiting by means of thresholds t(j), each | |||

of which is in the range 0 through base-1. Exactly one digit, the | of which is in the range 0 through base-1. Exactly one digit, the | |||

most significant, satisfies digit_j < t(j). Therefore, if several | most significant, satisfies digit_j < t(j). Therefore, if several | |||

integers are concatenated, it is easy to separate them, starting | integers are concatenated, it is easy to separate them, starting | |||

with the first if they are little-endian (least significant digit | with the first if they are little-endian (least significant digit | |||

first), or starting with the last if they are big-endian (most | first), or starting with the last if they are big-endian (most | |||

significant digit first). As before, the value is the sum over j of | significant digit first). As before, the value is the sum over j of | |||

skipping to change at line 221 | skipping to change at line 255 | |||

of which is in the range 0 through base-1. Exactly one digit, the | of which is in the range 0 through base-1. Exactly one digit, the | |||

most significant, satisfies digit_j < t(j). Therefore, if several | most significant, satisfies digit_j < t(j). Therefore, if several | |||

integers are concatenated, it is easy to separate them, starting | integers are concatenated, it is easy to separate them, starting | |||

with the first if they are little-endian (least significant digit | with the first if they are little-endian (least significant digit | |||

first), or starting with the last if they are big-endian (most | first), or starting with the last if they are big-endian (most | |||

significant digit first). As before, the value is the sum over j of | significant digit first). As before, the value is the sum over j of | |||

digit_j * w(j), but the weights are different: | digit_j * w(j), but the weights are different: | |||

w(0) = 1 | w(0) = 1 | |||

w(j) = w(j-1) * (base - t(j-1)) for j > 0 | w(j) = w(j-1) * (base - t(j-1)) for j > 0 | |||

For example, consider the little-endian sequence of base 8 digits | For example, consider the little-endian sequence of base 8 digits | |||

734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This | 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This | |||

implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) | implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) | |||

= 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is | = 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is | |||

not less than 3, but 4 is less than 5, so 4 must be the last digit. | not less than 3, but 4 is less than 5, so 4 is the last digit. The | |||

The value of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is | value of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251, | |||

251, with value 2*1 + 5*6 + 1*30 = 62. Decoding this representation | with value 2*1 + 5*6 + 1*30 = 62. Decoding this representation | |||

is very similar to decoding a conventional integer: Start with a | is very similar to decoding a conventional integer: Start with a | |||

current value of N = 0 and a weight w = 1. Fetch the next digit d | current value of N = 0 and a weight w = 1. Fetch the next digit d | |||

and increase N by d * w. If d is less than the current threshold | and increase N by d * w. If d is less than the current threshold | |||

(t) then stop, otherwise increase w by a factor of (base - t), | (t) then stop, otherwise increase w by a factor of (base - t), | |||

update t for the next position, and repeat. | update t for the next position, and repeat. | |||

Encoding this representation is similar to encoding a conventional | Encoding this representation is similar to encoding a conventional | |||

integer: If N < t then output one digit for N and stop, otherwise | integer: If N < t then output one digit for N and stop, otherwise | |||

output the digit for t + ((N - t) mod (base - t)), then replace N | output the digit for t + ((N - t) mod (base - t)), then replace N | |||

with (N - t) div (base - t), update t for the next position, and | with (N - t) div (base - t), update t for the next position, and | |||

skipping to change at line 306 | skipping to change at line 339 | |||

a hint about the likely size of the next delta, and so t(j) is | a hint about the likely size of the next delta, and so t(j) is | |||

set to tmax for the more significant digits starting with the one | set to tmax for the more significant digits starting with the one | |||

expected to be last, tmin for the less significant digits up through | expected to be last, tmin for the less significant digits up through | |||

the one expected to be third-last, and somewhere between tmin and | the one expected to be third-last, and somewhere between tmin and | |||

tmax for the digit expected to be second-last (balancing the hope of | tmax for the digit expected to be second-last (balancing the hope of | |||

the expected-last digit being unnecessary against the danger of it | the expected-last digit being unnecessary against the danger of it | |||

being insufficient). | being insufficient). | |||

4. Bootstring parameters | 4. Bootstring parameters | |||

Given a set of basic code points, one must be designated as | Given a set of basic code points, one needs to be designated as | |||

the delimiter. The base can be no greater than the number of | the delimiter. The base cannot be greater than the number of | |||

distinguishable basic code points remaining. The digit-values | distinguishable basic code points remaining. The digit-values in | |||

in the range 0 through base-1 must be associated with distinct | the range 0 through base-1 need to be associated with distinct | |||

non-delimiter basic code points. In some cases multiple code points | non-delimiter basic code points. In some cases multiple code points | |||

must have the same digit-value; for example, uppercase and lowercase | need to have the same digit-value; for example, uppercase and | |||

versions of the same letter must be equivalent if basic strings are | lowercase versions of the same letter need to be equivalent if basic | |||

case-insensitive. | strings are case-insensitive. | |||

The initial value of n must be no greater than the minimum non-basic | The initial value of n cannot be greater than the minimum non-basic | |||

code point that could appear in extended strings. | code point that could appear in extended strings. | |||

The remaining five parameters (tmin, tmax, skew, damp, and the | The remaining five parameters (tmin, tmax, skew, damp, and the | |||

initial value of bias) must satisfy the following constraints: | initial value of bias) need to satisfy the following constraints: | |||

0 <= tmin <= tmax <= base-1 | 0 <= tmin <= tmax <= base-1 | |||

skew >= 1 | skew >= 1 | |||

damp >= 2 | damp >= 2 | |||

initial_bias mod base <= base - tmin | initial_bias mod base <= base - tmin | |||

Provided the constraints are satisfied, these five parameters affect | Provided the constraints are satisfied, these five parameters affect | |||

efficiency but not correctness. They should be chosen empirically. | efficiency but not correctness. They are best chosen empirically. | |||

If support for mixed-case annotation is desired (see appendix B), | If support for mixed-case annotation is desired (see appendix B), | |||

make sure that the code points corresponding to 0 through tmax-1 all | make sure that the code points corresponding to 0 through tmax-1 all | |||

have both uppercase and lowercase forms. | have both uppercase and lowercase forms. | |||

5. Parameter values for Punycode | 5. Parameter values for Punycode | |||

Punycode uses the following Bootstring parameter values: | Punycode uses the following Bootstring parameter values: | |||

base = 36 | base = 36 | |||

tmin = 1 | tmin = 1 | |||

tmax = 26 | tmax = 26 | |||

skew = 38 | skew = 38 | |||

damp = 700 | damp = 700 | |||

initial_bias = 72 | initial_bias = 72 | |||

initial_n = 0x80 | initial_n = 128 = 0x80 | |||

In Punycode, code points are Unicode code points [UNICODE], that | Although the only restriction Punycode imposes on the input integers | |||

is, integers in the range 0..10FFFF, but not D800..DFFF, which are | is that they be nonnegative, these parameters are especially | |||

reserved for use by UTF-16. The basic code points are the ASCII | designed to work well with Unicode [UNICODE] code points, which | |||

code points (0..7F), of which U+002D (-) is the delimiter, and some | are integers in the range 0..10FFFF (but not D800..DFFF, which are | |||

of the others have digit-values as follows: | reserved for use by the UTF-16 encoding of Unicode). The basic code | |||

points are the ASCII code points (0..7F), of which U+002D (-) is the | ||||

delimiter, and some of the others have digit-values as follows: | ||||

code points digit-values | code points digit-values | |||

------------ ---------------------- | ------------ ---------------------- | |||

41..5A (A-Z) = 0 to 25, respectively | 41..5A (A-Z) = 0 to 25, respectively | |||

61..7A (a-z) = 0 to 25, respectively | 61..7A (a-z) = 0 to 25, respectively | |||

30..39 (0-9) = 26 to 35, respectively | 30..39 (0-9) = 26 to 35, respectively | |||

Using hyphen-minus as the delimiter implies that the encoded string | Using hyphen-minus as the delimiter implies that the encoded string | |||

can end with a hyphen-minus only if the Unicode string consists | can end with a hyphen-minus only if the Unicode string consists | |||

entirely of basic code points, but IDNA forbids such strings from | entirely of basic code points, but IDNA forbids such strings from | |||

being encoded. The encoded string can begin with a hyphen-minus, | being encoded. The encoded string can begin with a hyphen-minus, | |||

but IDNA prepends a prefix. Therefore IDNA using Punycode conforms | but IDNA prepends a prefix. Therefore IDNA using Punycode conforms | |||

to the RFC 952 recommendation that hostname labels neither begin nor | to the RFC 952 rule that host name labels neither begin nor end with | |||

end with a hyphen-minus [RFC952]. | a hyphen-minus [RFC952]. | |||

A decoder must recognize the letters in both uppercase and lowercase | A decoder MUST recognize the letters in both uppercase and lowercase | |||

forms (including mixtures of both forms). An encoder should output | forms (including mixtures of both forms). An encoder SHOULD output | |||

only uppercase forms or only lowercase forms, unless it uses | only uppercase forms or only lowercase forms, unless it uses | |||

mixed-case annotation (see appendix B). | mixed-case annotation (see appendix B). | |||

Presumably most users will not manually write or type encoded | Presumably most users will not manually write or type encoded | |||

strings (as opposed to cutting and pasting them), but those who do | strings (as opposed to cutting and pasting them), but those who do | |||

will need to be alert to the potential visual ambiguity between the | will need to be alert to the potential visual ambiguity between the | |||

following sets of characters: | following sets of characters: | |||

G 6 | G 6 | |||

I l 1 | I l 1 | |||

skipping to change at line 390 | skipping to change at line 425 | |||

Z 2 | Z 2 | |||

Such ambiguities are usually resolved by context, but in a Punycode | Such ambiguities are usually resolved by context, but in a Punycode | |||

encoded string there is no context apparent to humans. | encoded string there is no context apparent to humans. | |||

6. Bootstring algorithms | 6. Bootstring algorithms | |||

Some parts of the pseudocode can be omitted if the parameters | Some parts of the pseudocode can be omitted if the parameters | |||

satisfy certain conditions (for which Punycode qualifies). These | satisfy certain conditions (for which Punycode qualifies). These | |||

parts are enclosed in {braces}, and notes immediately following the | parts are enclosed in {braces}, and notes immediately following the | |||

pseudocode explain the conditions under which they may be omitted. | pseudocode explain the conditions under which they can be omitted. | |||

Formally, code points are integers, and hence the pseudocode assumes | Formally, code points are integers, and hence the pseudocode assumes | |||

that arithmetic operations can be performed directly on code points. | that arithmetic operations can be performed directly on code points. | |||

Some actual programming languages might require explicit conversion | In some programming languages, explicit conversion between code | |||

between code points and integers. | points and integers might be necessary. | |||

6.1 Bias adaptation function | 6.1 Bias adaptation function | |||

function adapt(delta,numpoints,firsttime): | function adapt(delta,numpoints,firsttime): | |||

if firsttime then let delta = delta div damp | if firsttime then let delta = delta div damp | |||

else let delta = delta div 2 | else let delta = delta div 2 | |||

let delta = delta + (delta div numpoints) | let delta = delta + (delta div numpoints) | |||

let k = 0 | let k = 0 | |||

while delta > ((base - tmin) * tmax) div 2 do begin | while delta > ((base - tmin) * tmax) div 2 do begin | |||

let delta = delta div (base - tmin) | let delta = delta div (base - tmin) | |||

skipping to change at line 432 | skipping to change at line 467 | |||

and copy them to output, fail on any non-basic code point | and copy them to output, fail on any non-basic code point | |||

if more than zero code points were consumed then consume one more | if more than zero code points were consumed then consume one more | |||

(which will be the last delimiter) | (which will be the last delimiter) | |||

while the input is not exhausted do begin | while the input is not exhausted do begin | |||

let oldi = i | let oldi = i | |||

let w = 1 | let w = 1 | |||

for k = base to infinity in steps of base do begin | for k = base to infinity in steps of base do begin | |||

consume a code point, or fail if there was none to consume | consume a code point, or fail if there was none to consume | |||

let digit = the code point's digit-value, fail if it has none | let digit = the code point's digit-value, fail if it has none | |||

let i = i + digit * w, fail on overflow | let i = i + digit * w, fail on overflow | |||

let t = tmin if k <= bias, tmax if k >= bias + tmax, or | let t = tmin if k <= bias {+ tmin}, or | |||

k - bias otherwise | tmax if k >= bias + tmax, or k - bias otherwise | |||

if digit < t then break | if digit < t then break | |||

let w = w * (base - t), fail on overflow | let w = w * (base - t), fail on overflow | |||

end | end | |||

let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) | let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) | |||

let n = n + i div (length(output) + 1), fail on overflow | let n = n + i div (length(output) + 1), fail on overflow | |||

let i = i mod (length(output) + 1) | let i = i mod (length(output) + 1) | |||

{if n is a basic code point then fail} | {if n is a basic code point then fail} | |||

insert n into output at position i | insert n into output at position i | |||

increment i | increment i | |||

end | end | |||

The statement enclosed in braces (checking whether n is a basic | The full statement enclosed in braces (checking whether n is a basic | |||

code point) may be omitted if initial_n exceeds all basic code | code point) can be omitted if initial_n exceeds all basic code | |||

points (which is true for Punycode), because n is never less than | points (which is true for Punycode), because n is never less than | |||

initial_n. | initial_n. | |||

In the assignment of t, where t is clamped to the range tmin through | ||||

tmax, "+ tmin" can always be omitted. This makes the clamping | ||||

calculation incorrect when bias < k < bias + tmin, but that cannot | ||||

happen because of the way bias is computed and because of the | ||||

constraints on the parameters. | ||||

Because the decoder state can only advance monotonically, and there | Because the decoder state can only advance monotonically, and there | |||

is only one representation of any delta, there is therefore only | is only one representation of any delta, there is therefore only | |||

one encoded string that can represent a given sequence of integers. | one encoded string that can represent a given sequence of integers. | |||

The only error conditions are invalid code points, unexpected | The only error conditions are invalid code points, unexpected | |||

end-of-input, overflow, and basic code points encoded using deltas | end-of-input, overflow, and basic code points encoded using deltas | |||

instead of appearing literally. If the decoder fails on these | instead of appearing literally. If the decoder fails on these | |||

errors as shown above, then it cannot produce the same output for | errors as shown above, then it cannot produce the same output for | |||

two distinct inputs, and hence it need not re-encode its output to | two distinct inputs, and hence it need not re-encode its output to | |||

verify that it matches the input. | verify that it matches the input. | |||

The assignment of t, where t is clamped to the range tmin through | ||||

tmax, does not handle the case where bias < k < bias + tmin, but | ||||

that is impossible because of the way bias is computed and because | ||||

of the constraints on the parameters. | ||||

If the programming language does not provide overflow detection, | If the programming language does not provide overflow detection, | |||

the following technique can be used. Suppose A, B, and C are | the following technique can be used. Suppose A, B, and C are | |||

representable nonnegative integers and C is nonzero. Then A + B | representable nonnegative integers and C is nonzero. Then A + B | |||

overflows if and only if B > maxint - A, and A + (B * C) overflows | overflows if and only if B > maxint - A, and A + (B * C) overflows | |||

if and only if B > (maxint - A) div C, where maxint is the greatest | if and only if B > (maxint - A) div C, where maxint is the greatest | |||

integer for which maxint + 1 cannot be represented. Refer to | integer for which maxint + 1 cannot be represented. Refer to | |||

appendix D "Punycode sample implementation" for demonstrations of | appendix D "Punycode sample implementation" for demonstrations of | |||

this technique in the C language. See also section 6.4 "Alternative | this technique in the C language. See also section 6.4 "Alternative | |||

methods for handling overflow". | methods for handling overflow". | |||

skipping to change at line 492 | skipping to change at line 528 | |||

{if the input contains a non-basic code point < n then fail} | {if the input contains a non-basic code point < n then fail} | |||

while h < length(input) do begin | while h < length(input) do begin | |||

let m = the minimum {non-basic} code point >= n in the input | let m = the minimum {non-basic} code point >= n in the input | |||

let delta = delta + (m - n) * (h + 1), fail on overflow | let delta = delta + (m - n) * (h + 1), fail on overflow | |||

let n = m | let n = m | |||

for each code point c in the input (in order) do begin | for each code point c in the input (in order) do begin | |||

if c < n {or c is basic} then increment delta, fail on overflow | if c < n {or c is basic} then increment delta, fail on overflow | |||

if c == n then begin | if c == n then begin | |||

let q = delta | let q = delta | |||

for k = base to infinity in steps of base do begin | for k = base to infinity in steps of base do begin | |||

let t = tmin if k <= bias, tmax if k >= bias + tmax, or | let t = tmin if k <= bias {+ tmin}, or | |||

k - bias otherwise | tmax if k >= bias + tmax, or k - bias otherwise | |||

if q < t then break | if q < t then break | |||

output the code point for digit t + ((q - t) mod (base - t)) | output the code point for digit t + ((q - t) mod (base - t)) | |||

let q = (q - t) div (base - t) | let q = (q - t) div (base - t) | |||

end | end | |||

output the code point for digit q | output the code point for digit q | |||

let bias = adapt(delta, h + 1, test h equals b?) | let bias = adapt(delta, h + 1, test h equals b?) | |||

let delta = 0 | let delta = 0 | |||

increment h | increment h | |||

end | end | |||

end | end | |||

skipping to change at line 516 | skipping to change at line 553 | |||

The full statement enclosed in braces (checking whether the input | The full statement enclosed in braces (checking whether the input | |||

contains a non-basic code point less than n) can be omitted if all | contains a non-basic code point less than n) can be omitted if all | |||

code points less than initial_n are basic code points (which is true | code points less than initial_n are basic code points (which is true | |||

for Punycode if code points are unsigned). | for Punycode if code points are unsigned). | |||

The brace-enclosed conditions "non-basic" and "or m is basic" can be | The brace-enclosed conditions "non-basic" and "or m is basic" can be | |||

omitted if initial_n exceeds all basic code points (which is true | omitted if initial_n exceeds all basic code points (which is true | |||

for Punycode), because the code point being tested is never less | for Punycode), because the code point being tested is never less | |||

than initial_n. | than initial_n. | |||

In the assignment of t, where t is clamped to the range tmin through | ||||

tmax, "+ tmin" can always be omitted. This makes the clamping | ||||

calculation incorrect when bias < k < bias + tmin, but that cannot | ||||

happen because of the way bias is computed and because of the | ||||

constraints on the parameters. | ||||

The checks for overflow are necessary to avoid producing invalid | The checks for overflow are necessary to avoid producing invalid | |||

output when the input contains very large values or is very long. | output when the input contains very large values or is very long. | |||

Wider integer variables can handle more extreme inputs. For IDNA, | Wider integer variables can handle more extreme inputs. For IDNA, | |||

26-bit unsigned integers are sufficient, because any string that | 26-bit unsigned integers are sufficient, because any string that | |||

required a 27-bit delta would have to exceed either the code point | needed a 27-bit delta would have to exceed either the code point | |||

limit (0..10FFFF) or the label length limit (63 characters). | limit (0..10FFFF) or the label length limit (63 characters). | |||

The increment of delta at the bottom of the outer loop cannot | The increment of delta at the bottom of the outer loop cannot | |||

overflow because delta < length(input) before the increment, and | overflow because delta < length(input) before the increment, and | |||

length(input) is already assumed to be representable. The increment | length(input) is already assumed to be representable. The increment | |||

of n could overflow, but only if h == length(input), in which case | of n could overflow, but only if h == length(input), in which case | |||

the procedure is finished anyway. | the procedure is finished anyway. | |||

6.4 Alternative methods for handling overflow | 6.4 Alternative methods for handling overflow | |||

skipping to change at line 547 | skipping to change at line 590 | |||

if integer variables were capable of representing values that large. | if integer variables were capable of representing values that large. | |||

This prevention approach would impose more restrictions on the input | This prevention approach would impose more restrictions on the input | |||

than the detection approach does, but might be considered simpler in | than the detection approach does, but might be considered simpler in | |||

some programming languages. | some programming languages. | |||

In theory, the decoder could use an analogous approach, limiting the | In theory, the decoder could use an analogous approach, limiting the | |||

number of digits in a variable-length integer (that is, limiting the | number of digits in a variable-length integer (that is, limiting the | |||

number of iterations in the innermost loop). However, the number | number of iterations in the innermost loop). However, the number | |||

of digits that suffice to represent a given delta can sometimes | of digits that suffice to represent a given delta can sometimes | |||

represent much larger deltas (because of the adaptation), and hence | represent much larger deltas (because of the adaptation), and hence | |||

this approach would probably require integers wider than 32 bits. | this approach would probably need integers wider than 32 bits. | |||

Yet another approach for the decoder is to allow overflow to occur, | Yet another approach for the decoder is to allow overflow to occur, | |||

but to check the final output string by re-encoding it and comparing | but to check the final output string by re-encoding it and comparing | |||

to the decoder input. If and only if they do not match (using a | to the decoder input. If and only if they do not match (using a | |||

case-insensitive ASCII comparison) overflow has occurred. This | case-insensitive ASCII comparison) overflow has occurred. This | |||

delayed-detection approach would not impose any more restrictions on | delayed-detection approach would not impose any more restrictions on | |||

the input than the immediate-detection approach does, and might be | the input than the immediate-detection approach does, and might be | |||

considered simpler in some programming languages. | considered simpler in some programming languages. | |||

In fact, if the decoder is used only inside the IDNA ToUnicode | In fact, if the decoder is used only inside the IDNA ToUnicode | |||

operation [IDNA], then it need not check for overflow at all, | operation [IDNA], then it need not check for overflow at all, | |||

because ToUnicode performs a higher level re-encoding and | because ToUnicode performs a higher level re-encoding and | |||

comparison, and a mismatch has the same consequence as if the | comparison, and a mismatch has the same consequence as if the | |||

Punycode decoder had failed. | Punycode decoder had failed. | |||

7. Punycode example strings | 7. Punycode examples | |||

7.1 Sample strings | ||||

In the Punycode encodings below, the IDNA signature prefix is not | In the Punycode encodings below, the IDNA signature prefix is not | |||

shown. Backslashes show where line breaks have been inserted in | shown. Backslashes show where line breaks have been inserted in | |||

strings too long for one line. | strings too long for one line. | |||

The first several examples are all translations of the sentence "Why | The first several examples are all translations of the sentence "Why | |||

can't they just speak in <language>?" (courtesy of Michael Kaplan's | can't they just speak in <language>?" (courtesy of Michael Kaplan's | |||

"provincial" page [PROVINCIAL]). Word breaks and punctuation have | "provincial" page [PROVINCIAL]). Word breaks and punctuation have | |||

been removed, as is often done in domain names. | been removed, as is often done in domain names. | |||

skipping to change at line 680 | skipping to change at line 725 | |||

Punycode: MajiKoi5-783gue6qz075azm5e | Punycode: MajiKoi5-783gue6qz075azm5e | |||

(Q) <pafii>de<runba> | (Q) <pafii>de<runba> | |||

u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 | u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 | |||

Punycode: de-jg4avhby1noc0d | Punycode: de-jg4avhby1noc0d | |||

(R) <sono><supiido><de> | (R) <sono><supiido><de> | |||

u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 | u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 | |||

Punycode: d9juau41awczczp | Punycode: d9juau41awczczp | |||

The last example is an ASCII string that breaks not only the | The last example is an ASCII string that breaks not only the | |||

existing rules for host name labels but also the rules proposed in | existing rules for host name labels but also the rules in [NAMEPREP] | |||

[NAMEPREP] for internationalized domain names. | for internationalized domain names. | |||

(S) -> $1.00 <- | (S) -> $1.00 <- | |||

u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 | u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 | |||

u+003C u+002D | u+003C u+002D | |||

Punycode: -> $1.00 <-- | Punycode: -> $1.00 <-- | |||

7.2 Decoding traces | ||||

In the following traces, the evolving state of the decoder is | ||||

shown as a sequence of hexadecimal values, representing the code | ||||

points in the extended string. An asterisk appears just after the | ||||

most recently inserted code point, indicating both n (the value | ||||

preceeding the asterisk) and i (the position of the value just after | ||||

the asterisk). Other numerical values are decimal. | ||||

Decoding trace of example B from section 7.1: | ||||

n is 128, i is 0, bias is 72 | ||||

input is "ihqwcrb4cv8a8dqg056pqjye" | ||||

there is no delimiter, so extended string starts empty | ||||

delta "ihq" decodes to 19853 | ||||

bias becomes 21 | ||||

4E0D * | ||||

delta "wc" decodes to 64 | ||||

bias becomes 20 | ||||

4E0D 4E2D * | ||||

delta "rb" decodes to 37 | ||||

bias becomes 13 | ||||

4E3A * 4E0D 4E2D | ||||

delta "4c" decodes to 56 | ||||

bias becomes 17 | ||||

4E3A 4E48 * 4E0D 4E2D | ||||

delta "v8a" decodes to 599 | ||||

bias becomes 32 | ||||

4E3A 4EC0 * 4E48 4E0D 4E2D | ||||

delta "8d" decodes to 130 | ||||

bias becomes 23 | ||||

4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D | ||||

delta "qg" decodes to 154 | ||||

bias becomes 25 | ||||

4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D | ||||

delta "056p" decodes to 46301 | ||||

bias becomes 84 | ||||

4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 * | ||||

delta "qjye" decodes to 88531 | ||||

bias becomes 90 | ||||

4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587 | ||||

Decoding trace of example L from section 7.1: | ||||

n is 128, i is 0, bias is 72 | ||||

input is "3B-ww4c5e180e575a65lsy2b" | ||||

literal portion is "3B-", so extended string starts as: | ||||

0033 0042 | ||||

delta "ww4c" decodes to 62042 | ||||

bias becomes 27 | ||||

0033 0042 5148 * | ||||

delta "5e" decodes to 139 | ||||

bias becomes 24 | ||||

0033 0042 516B * 5148 | ||||

delta "180e" decodes to 16683 | ||||

bias becomes 67 | ||||

0033 5E74 * 0042 516B 5148 | ||||

delta "575a" decodes to 34821 | ||||

bias becomes 82 | ||||

0033 5E74 0042 516B 5148 751F * | ||||

delta "65l" decodes to 14592 | ||||

bias becomes 67 | ||||

0033 5E74 0042 7D44 * 516B 5148 751F | ||||

delta "sy2b" decodes to 42088 | ||||

bias becomes 84 | ||||

0033 5E74 0042 7D44 91D1 * 516B 5148 751F | ||||

7.3 Encoding traces | ||||

In the following traces, code point values are hexadecimal, while | ||||

other numerical values are decimal. | ||||

Encoding trace of example B from section 7.1: | ||||

bias is 72 | ||||

input is: | ||||

4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587 | ||||

there are no basic code points, so no literal portion | ||||

next code point to insert is 4E0D | ||||

needed delta is 19853, encodes as "ihq" | ||||

bias becomes 21 | ||||

next code point to insert is 4E2D | ||||

needed delta is 64, encodes as "wc" | ||||

bias becomes 20 | ||||

next code point to insert is 4E3A | ||||

needed delta is 37, encodes as "rb" | ||||

bias becomes 13 | ||||

next code point to insert is 4E48 | ||||

needed delta is 56, encodes as "4c" | ||||

bias becomes 17 | ||||

next code point to insert is 4EC0 | ||||

needed delta is 599, encodes as "v8a" | ||||

bias becomes 32 | ||||

next code point to insert is 4ED6 | ||||

needed delta is 130, encodes as "8d" | ||||

bias becomes 23 | ||||

next code point to insert is 4EEC | ||||

needed delta is 154, encodes as "qg" | ||||

bias becomes 25 | ||||

next code point to insert is 6587 | ||||

needed delta is 46301, encodes as "056p" | ||||

bias becomes 84 | ||||

next code point to insert is 8BF4 | ||||

needed delta is 88531, encodes as "qjye" | ||||

bias becomes 90 | ||||

output is "ihqwcrb4cv8a8dqg056pqjye" | ||||

Encoding trace of example L from section 7.1: | ||||

bias is 72 | ||||

input is: | ||||

0033 5E74 0042 7D44 91D1 516B 5148 751F | ||||

basic code points (0033, 0042) are copied to literal portion: "3B-" | ||||

next code point to insert is 5148 | ||||

needed delta is 62042, encodes as "ww4c" | ||||

bias becomes 27 | ||||

next code point to insert is 516B | ||||

needed delta is 139, encodes as "5e" | ||||

bias becomes 24 | ||||

next code point to insert is 5E74 | ||||

needed delta is 16683, encodes as "180e" | ||||

bias becomes 67 | ||||

next code point to insert is 751F | ||||

needed delta is 34821, encodes as "575a" | ||||

bias becomes 82 | ||||

next code point to insert is 7D44 | ||||

needed delta is 14592, encodes as "65l" | ||||

bias becomes 67 | ||||

next code point to insert is 91D1 | ||||

needed delta is 42088, encodes as "sy2b" | ||||

bias becomes 84 | ||||

output is "3B-ww4c5e180e575a65lsy2b" | ||||

8. Security considerations | 8. Security considerations | |||

Users expect each domain name in DNS to be controlled by a single | Users expect each domain name in DNS to be controlled by a single | |||

authority. If a Unicode string intended for use as a domain label | authority. If a Unicode string intended for use as a domain label | |||

could map to multiple ACE labels, then an internationalized domain | could map to multiple ACE labels, then an internationalized domain | |||

name could map to multiple ASCII domain names, each controlled by | name could map to multiple ASCII domain names, each controlled by | |||

a different authority, some of which could be spoofs that hijack | a different authority, some of which could be spoofs that hijack | |||

service requests intended for another. Therefore Punycode is | service requests intended for another. Therefore Punycode is | |||

designed so that each Unicode string has a unique encoding. | designed so that each Unicode string has a unique encoding. | |||

However, there can still be multiple Unicode representations of the | However, there can still be multiple Unicode representations of the | |||

"same" text, for various definitions of "same". This problem is | "same" text, for various definitions of "same". This problem is | |||

addressed to some extent by the Unicode standard under the topic of | addressed to some extent by the Unicode standard under the topic of | |||

canonicalization, and this work is leveraged for domain names by | canonicalization, and this work is leveraged for domain names by | |||

"nameprep" [NAMEPREP]. | Nameprep [NAMEPREP]. | |||

9. References | 9. References | |||

[IDN] Internationalized Domain Names (IETF working group), | ||||

http://www.i-d-n.net/, idn@ops.ietf.org. | ||||

[IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello, | [IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello, | |||

"Internationalizing Host Names In Applications (IDNA)", 2001-Nov-19, | "Internationalizing Domain Names In Applications (IDNA)", | |||

draft-ietf-idn-idna-05. | 2002-###-##, draft-ietf-idn-idna-07. | |||

[NAMEPREP] Paul Hoffman, Marc Blanchet, "Stringprep | [NAMEPREP] Paul Hoffman, Marc Blanchet, "Nameprep: A Stringprep | |||

Profile for Internationalized Host Names", 2001-Sep-27, | Profile for Internationalized Domain Names", 2002-###-##, | |||

draft-ietf-idn-nameprep-06. | draft-ietf-idn-nameprep-08. | |||

[PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", | [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", | |||

http://www.trigeminal.com/samples/provincial.html. | http://www.trigeminal.com/samples/provincial.html. | |||

[RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host | [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host | |||

Table Specification", 1985-Oct, RFC 952. | Table Specification", 1985-Oct, RFC 952. | |||

[RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", | [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", | |||

1987-Nov, RFC 1034. | 1987-Nov, RFC 1034. | |||

skipping to change at line 739 | skipping to change at line 912 | |||

A. Author contact information | A. Author contact information | |||

Adam M. Costello | Adam M. Costello | |||

University of California, Berkeley | University of California, Berkeley | |||

http://www.nicemice.net/amc/ | http://www.nicemice.net/amc/ | |||

B. Mixed-case annotation | B. Mixed-case annotation | |||

In order to use Punycode to represent case-insensitive strings, | In order to use Punycode to represent case-insensitive strings, | |||

higher layers need to case-fold the strings prior to Punycode | higher layers need to case-fold the strings prior to Punycode | |||

encoding. The encoded string can, however, use mixed case as an | encoding. The encoded string can use mixed case as an annotation | |||

annotation telling how to convert the original folded string into a | telling how to convert the folded string into a mixed-case string | |||

mixed-case string for display purposes. | for display purposes. Note, however, that mixed-case annotation | |||

is not used by the ToASCII and ToUnicode operations specified in | ||||

[IDNA], and therefore implementors of IDNA can disregard this | ||||

appendix. | ||||

Basic code points can use mixed case directly, because the decoder | Basic code points can use mixed case directly, because the decoder | |||

copies them verbatim, leaving lowercase code points lowercase, and | copies them verbatim, leaving lowercase code points lowercase, and | |||

leaving uppercase code points uppercase. Each non-basic code point | leaving uppercase code points uppercase. Each non-basic code point | |||

is represented by a delta, which is represented by a sequence of | is represented by a delta, which is represented by a sequence of | |||

basic code points, the last of which provides the annotation. If it | basic code points, the last of which provides the annotation. If it | |||

is uppercase, it is a suggestion to map the non-basic code point to | is uppercase, it is a suggestion to map the non-basic code point to | |||

uppercase (if possible); if it is lowercase, it is a suggestion to | uppercase (if possible); if it is lowercase, it is a suggestion to | |||

map the non-basic code point to lowercase (if possible). | map the non-basic code point to lowercase (if possible). | |||

Punycode encoders and decoders are not required to support these | These annotations do not alter the code points returned by decoders; | |||

annotations, and higher layers need not use them. | the annotations are returned separately, for the caller to use or | |||

ignore. Encoders can accept annotations in addition to code points, | ||||

but the annotations do not alter the output, except to influence the | ||||

uppercase/lowercase form of ASCII letters. | ||||

C. Disclaimer and License | Punycode encoders and decoders need not support these annotations, | |||

and higher layers need not use them. | ||||

C. Disclaimer and license | ||||

Regarding this entire document or any portion of it (including | Regarding this entire document or any portion of it (including | |||

the pseudocode and C code), the author makes no guarantees and | the pseudocode and C code), the author makes no guarantees and | |||

is not responsible for any damage resulting from its use. The | is not responsible for any damage resulting from its use. The | |||

author grants irrevocable permission to anyone to use, modify, | author grants irrevocable permission to anyone to use, modify, | |||

and distribute it in any way that does not diminish the rights | and distribute it in any way that does not diminish the rights | |||

of anyone else to use, modify, and distribute it, provided that | of anyone else to use, modify, and distribute it, provided that | |||

redistributed derivative works do not contain misleading author or | redistributed derivative works do not contain misleading author or | |||

version information. Derivative works need not be licensed under | version information. Derivative works need not be licensed under | |||

similar terms. | similar terms. | |||

D. Punycode sample implementation | D. Punycode sample implementation | |||

/* | /* | |||

punycode.c 0.4.0 (2001-Nov-17-Sat) | punycode.c from draft-ietf-idn-punycode-01 | |||

http://www.cs.berkeley.edu/~amc/idn/ | http://www.nicemice.net/idn/ | |||

Adam M. Costello | Adam M. Costello | |||

http://www.nicemice.net/amc/ | http://www.nicemice.net/amc/ | |||

*/ | ||||

/* This is ANSI C code (C89) implementing Punycode version 0.3.x. */ | This is ANSI C code (C89) implementing | |||

Punycode (draft-ietf-idn-punycode-01). | ||||

*/ | ||||

/************************************************************/ | /************************************************************/ | |||

/* Public interface (would normally go in its own .h file): */ | /* Public interface (would normally go in its own .h file): */ | |||

#include <limits.h> | #include <limits.h> | |||

enum punycode_status { | enum punycode_status { | |||

punycode_success, | punycode_success, | |||

punycode_bad_input, /* Input is invalid. */ | punycode_bad_input, /* Input is invalid. */ | |||

punycode_big_output, /* Output would exceed the space provided. */ | punycode_big_output, /* Output would exceed the space provided. */ | |||

punycode_overflow /* Input requires wider integers to process. */ | punycode_overflow /* Input needs wider integers to process. */ | |||

}; | }; | |||

#if UINT_MAX >= (1 << 26) - 1 | #if UINT_MAX >= (1 << 26) - 1 | |||

typedef unsigned int punycode_uint; | typedef unsigned int punycode_uint; | |||

#else | #else | |||

typedef unsigned long punycode_uint; | typedef unsigned long punycode_uint; | |||

#endif | #endif | |||

enum punycode_status punycode_encode( | enum punycode_status punycode_encode( | |||

punycode_uint input_length, | punycode_uint input_length, | |||

skipping to change at line 802 | skipping to change at line 986 | |||

#else | #else | |||

typedef unsigned long punycode_uint; | typedef unsigned long punycode_uint; | |||

#endif | #endif | |||

enum punycode_status punycode_encode( | enum punycode_status punycode_encode( | |||

punycode_uint input_length, | punycode_uint input_length, | |||

const punycode_uint input[], | const punycode_uint input[], | |||

const unsigned char case_flags[], | const unsigned char case_flags[], | |||

punycode_uint *output_length, | punycode_uint *output_length, | |||

char output[] ); | char output[] ); | |||

/* punycode_encode() converts Unicode to Punycode. The input */ | /* punycode_encode() converts Unicode to Punycode. The input */ | |||

/* must be represented as an array of Unicode code points (not */ | /* is represented as an array of Unicode code points (not code */ | |||

/* code units; surrogate pairs are not allowed), and the output */ | /* units; surrogate pairs are not allowed), and the output */ | |||

/* will be represented as an array of ASCII code points. The */ | /* will be represented as an array of ASCII code points. The */ | |||

/* output string is *not* null-terminated; it will contain */ | /* output string is *not* null-terminated; it will contain */ | |||

/* zeros if and only if the input contains zeros. (Of course */ | /* zeros if and only if the input contains zeros. (Of course */ | |||

/* the caller can leave room for a terminator and add one if */ | /* the caller can leave room for a terminator and add one if */ | |||

/* needed.) The input_length is the number of code points in the */ | /* needed.) The input_length is the number of code points in */ | |||

/* input. The output_length is an in/out argument: the caller */ | /* the input. The output_length is an in/out argument: the */ | |||

/* must pass in the maximum number of code points that may be */ | /* caller passes in the maximum number of code points that it */ | |||

/* output, and on successful return it will contain the number */ | /* can receive, and on successful return it will contain the */ | |||

/* of code points actually output. The case_flags array must */ | /* number of code points actually output. The case_flags array */ | |||

/* hold input_length boolean values, where nonzero means the */ | /* holds input_length boolean values, where nonzero suggests that */ | |||

/* corresponding Unicode character should be forced to uppercase */ | /* the corresponding Unicode character be forced to uppercase */ | |||

/* after being decoded (if possible), and zero means it should */ | /* after being decoded (if possible), and zero suggests that */ | |||

/* be forced to lowercase (if possible). ASCII code points */ | /* it be forced to lowercase (if possible). ASCII code points */ | |||

/* are encoded literally, except that ASCII letters are forced */ | /* are encoded literally, except that ASCII letters are forced */ | |||

/* to uppercase or lowercase according to the corresponding */ | /* to uppercase or lowercase according to the corresponding */ | |||

/* uppercase flags. If case_flags is a null pointer then ASCII */ | /* uppercase flags. If case_flags is a null pointer then ASCII */ | |||

/* letters are left as they are, and other code points are */ | /* letters are left as they are, and other code points are */ | |||

/* treated as if their uppercase flags were zero. The return */ | /* treated as if their uppercase flags were zero. The return */ | |||

/* value may be any of the punycode_status values defined above */ | /* value can be any of the punycode_status values defined above */ | |||

/* except punycode_bad_input; if not punycode_success, then */ | /* except punycode_bad_input; if not punycode_success, then */ | |||

/* output_size and output may contain garbage. */ | /* output_size and output might contain garbage. */ | |||

enum punycode_status punycode_decode( | enum punycode_status punycode_decode( | |||

punycode_uint input_length, | punycode_uint input_length, | |||

const char input[], | const char input[], | |||

punycode_uint *output_length, | punycode_uint *output_length, | |||

punycode_uint output[], | punycode_uint output[], | |||

unsigned char case_flags[] ); | unsigned char case_flags[] ); | |||

/* punycode_decode() converts Punycode to Unicode. The input */ | ||||

/* must be represented as an array of ASCII code points, and */ | /* punycode_decode() converts Punycode to Unicode. The input is */ | |||

/* the output will be represented as an array of Unicode code */ | /* represented as an array of ASCII code points, and the output */ | |||

/* points. The input_length is the number of code points in */ | /* will be represented as an array of Unicode code points. The */ | |||

/* the input. The output_length is an in/out argument: the */ | /* input_length is the number of code points in the input. The */ | |||

/* caller must pass in the maximum number of code points that */ | /* output_length is an in/out argument: the caller passes in */ | |||

/* may be output, and on successful return it will contain the */ | /* the maximum number of code points that it can receive, and */ | |||

/* actual number of code points output. The case_flags array */ | /* on successful return it will contain the actual number of */ | |||

/* must have room for at least output_length values, or it may */ | /* code points output. The case_flags array needs room for at */ | |||

/* be a null pointer if the case information is not needed. */ | /* least output_length values, or it can be a null pointer if the */ | |||

/* A nonzero flag indicates that the corresponding Unicode */ | /* case information is not needed. A nonzero flag suggests that */ | |||

/* character should be forced to uppercase by the caller (if */ | /* the corresponding Unicode character be forced to uppercase */ | |||

/* possible), while zero means it should be forced to lowercase */ | /* by the caller (if possible), while zero suggests that it be */ | |||

/* (if possible). ASCII code points are output already in the */ | /* forced to lowercase (if possible). ASCII code points are */ | |||

/* proper case, but their flags will be set appropriately so that */ | /* output already in the proper case, but their flags will be set */ | |||

/* applying the flags would be harmless. The return value may */ | /* appropriately so that applying the flags would be harmless. */ | |||

/* be any of the punycode_status values defined above; if not */ | /* The return value can be any of the punycode_status values */ | |||

/* punycode_success, then output_length, output, and case_flags */ | /* defined above; if not punycode_success, then output_length, */ | |||

/* may contain garbage. On success, the decoder will never need */ | /* output, and case_flags might contain garbage. On success, the */ | |||

/* to write an output_length greater than input_length, because */ | /* decoder will never need to write an output_length greater than */ | |||

/* of how the encoding is defined. */ | /* input_length, because of how the encoding is defined. */ | |||

/**********************************************************/ | /**********************************************************/ | |||

/* Implementation (would normally go in its own .c file): */ | /* Implementation (would normally go in its own .c file): */ | |||

#include <string.h> | #include <string.h> | |||

/*** Bootstring parameters for Punycode ***/ | /*** Bootstring parameters for Punycode ***/ | |||

enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700, | enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700, | |||

initial_bias = 72, initial_n = 0x80, delimiter = 0x2D }; | initial_bias = 72, initial_n = 0x80, delimiter = 0x2D }; | |||

skipping to change at line 883 | skipping to change at line 1066 | |||

/* point (for use in representing integers) in the range 0 to */ | /* point (for use in representing integers) in the range 0 to */ | |||

/* base-1, or base if cp is does not represent a value. */ | /* base-1, or base if cp is does not represent a value. */ | |||

static punycode_uint decode_digit(punycode_uint cp) | static punycode_uint decode_digit(punycode_uint cp) | |||

{ | { | |||

return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 : | return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 : | |||

cp - 97 < 26 ? cp - 97 : base; | cp - 97 < 26 ? cp - 97 : base; | |||

} | } | |||

/* encode_digit(d,flag) returns the basic code point whose value */ | /* encode_digit(d,flag) returns the basic code point whose value */ | |||

/* (when used for representing integers) is d, which must be in the */ | /* (when used for representing integers) is d, which needs to be in */ | |||

/* range 0 to base-1. The lowercase form is used unless flag is */ | /* the range 0 to base-1. The lowercase form is used unless flag is */ | |||

/* nonzero, in which case the uppercase form is used. The behavior */ | /* nonzero, in which case the uppercase form is used. The behavior */ | |||

/* is undefined if flag is nonzero and digit d has no uppercase form. */ | /* is undefined if flag is nonzero and digit d has no uppercase form. */ | |||

static char encode_digit(punycode_uint d, int flag) | static char encode_digit(punycode_uint d, int flag) | |||

{ | { | |||

return d + 22 + 75 * (d < 26) - ((flag != 0) << 5); | return d + 22 + 75 * (d < 26) - ((flag != 0) << 5); | |||

/* 0..25 map to ASCII a..z or A..Z */ | /* 0..25 map to ASCII a..z or A..Z */ | |||

/* 26..35 map to ASCII 0..9 */ | /* 26..35 map to ASCII 0..9 */ | |||

} | } | |||

skipping to change at line 1003 | skipping to change at line 1187 | |||

/* Punycode does not need to check whether input[j] is basic: */ | /* Punycode does not need to check whether input[j] is basic: */ | |||

if (input[j] < n /* || basic(input[j]) */ ) { | if (input[j] < n /* || basic(input[j]) */ ) { | |||

if (++delta == 0) return punycode_overflow; | if (++delta == 0) return punycode_overflow; | |||

} | } | |||

if (input[j] == n) { | if (input[j] == n) { | |||

/* Represent delta as a generalized variable-length integer: */ | /* Represent delta as a generalized variable-length integer: */ | |||

for (q = delta, k = base; ; k += base) { | for (q = delta, k = base; ; k += base) { | |||

if (out >= max_out) return punycode_big_output; | if (out >= max_out) return punycode_big_output; | |||

t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; | t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */ | |||

k >= bias + tmax ? tmax : k - bias; | ||||

if (q < t) break; | if (q < t) break; | |||

output[out++] = encode_digit(t + (q - t) % (base - t), 0); | output[out++] = encode_digit(t + (q - t) % (base - t), 0); | |||

q = (q - t) / (base - t); | q = (q - t) / (base - t); | |||

} | } | |||

output[out++] = encode_digit(q, case_flags && case_flags[j]); | output[out++] = encode_digit(q, case_flags && case_flags[j]); | |||

bias = adapt(delta, h + 1, h == b); | bias = adapt(delta, h + 1, h == b); | |||

delta = 0; | delta = 0; | |||

++h; | ++h; | |||

} | } | |||

skipping to change at line 1073 | skipping to change at line 1259 | |||

/* which gets added to i. The overflow checking is easier */ | /* which gets added to i. The overflow checking is easier */ | |||

/* if we increase i as we go, then subtract off its starting */ | /* if we increase i as we go, then subtract off its starting */ | |||

/* value at the end to obtain delta. */ | /* value at the end to obtain delta. */ | |||

for (oldi = i, w = 1, k = base; ; k += base) { | for (oldi = i, w = 1, k = base; ; k += base) { | |||

if (in >= input_length) return punycode_bad_input; | if (in >= input_length) return punycode_bad_input; | |||

digit = decode_digit(input[in++]); | digit = decode_digit(input[in++]); | |||

if (digit >= base) return punycode_bad_input; | if (digit >= base) return punycode_bad_input; | |||

if (digit > (maxint - i) / w) return punycode_overflow; | if (digit > (maxint - i) / w) return punycode_overflow; | |||

i += digit * w; | i += digit * w; | |||

t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; | t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */ | |||

k >= bias + tmax ? tmax : k - bias; | ||||

if (digit < t) break; | if (digit < t) break; | |||

if (w > maxint / (base - t)) return punycode_overflow; | if (w > maxint / (base - t)) return punycode_overflow; | |||

w *= (base - t); | w *= (base - t); | |||

} | } | |||

bias = adapt(i - oldi, out + 1, oldi == 0); | bias = adapt(i - oldi, out + 1, oldi == 0); | |||

/* i was supposed to wrap around from out+1 to 0, */ | /* i was supposed to wrap around from out+1 to 0, */ | |||

/* incrementing n each time, so we'll fix that now: */ | /* incrementing n each time, so we'll fix that now: */ | |||

if (i / (out + 1) > maxint - n) return punycode_overflow; | if (i / (out + 1) > maxint - n) return punycode_overflow; | |||

n += i / (out + 1); | n += i / (out + 1); | |||

i %= (out + 1); | i %= (out + 1); | |||

/* Insert n at position i of the output: */ | /* Insert n at position i of the output: */ | |||

skipping to change at line 1135 | skipping to change at line 1322 | |||

{ | { | |||

fprintf(stderr, | fprintf(stderr, | |||

"\n" | "\n" | |||

"%s -e reads code points and writes a Punycode string.\n" | "%s -e reads code points and writes a Punycode string.\n" | |||

"%s -d reads a Punycode string and writes code points.\n" | "%s -d reads a Punycode string and writes code points.\n" | |||

"\n" | "\n" | |||

"Input and output are plain text in the native character set.\n" | "Input and output are plain text in the native character set.\n" | |||

"Code points are in the form u+hex separated by whitespace.\n" | "Code points are in the form u+hex separated by whitespace.\n" | |||

"Although the specification allows Punycode strings to contain\n" | "Although the specification allows Punycode strings to contain\n" | |||

"any characters from the ASCII repertoire, this test code\n" | "any characters from the ASCII repertoire, this test code\n" | |||

"supports only the printable characters, and requires the\n" | "supports only the printable characters, and needs the Punycode\n" | |||

"Punycode string to be followed by a newline.\n" | "string to be followed by a newline.\n" | |||

"The case of the u in u+hex is the force-to-uppercase flag.\n" | "The case of the u in u+hex is the force-to-uppercase flag.\n" | |||

, argv[0], argv[0]); | , argv[0], argv[0]); | |||

exit(EXIT_FAILURE); | exit(EXIT_FAILURE); | |||

} | } | |||

static void fail(const char *msg) | static void fail(const char *msg) | |||

{ | { | |||

fputs(msg,stderr); | fputs(msg,stderr); | |||

exit(EXIT_FAILURE); | exit(EXIT_FAILURE); | |||

} | } | |||

skipping to change at line 1274 | skipping to change at line 1461 | |||

if (r < 0) fail(io_error); | if (r < 0) fail(io_error); | |||

} | } | |||

return EXIT_SUCCESS; | return EXIT_SUCCESS; | |||

} | } | |||

usage(argv); | usage(argv); | |||

return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ | return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ | |||

} | } | |||

INTERNET-DRAFT expires 2002-Jul-06 | INTERNET-DRAFT expires 2002-Aug-24 | |||

End of changes. 65 change blocks. | ||||

176 lines changed or deleted | | 359 lines changed or added | ||

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |