draft-ietf-idn-punycode-02.txt | rfc3492.txt | |||
---|---|---|---|---|

INTERNET-DRAFT Adam M. Costello | ||||

draft-ietf-idn-punycode-02.txt 2002-May-23 | ||||

Expires 2002-Nov-23 | ||||

Punycode: An encoding of Unicode for use with IDNA | Network Working Group A. Costello | |||

Request for Comments: 3492 Univ. of California, Berkeley | ||||

Punycode: A Bootstring encoding of Unicode | ||||

for Internationalized Domain Names in Applications (IDNA) | ||||

Status of this Memo | Status of this Memo | |||

This document is an Internet-Draft and is in full conformance with | This document specifies an Internet standards track protocol for the | |||

all provisions of Section 10 of RFC2026. | Internet community, and requests discussion and suggestions for | |||

improvements. Please refer to the current edition of the "Internet | ||||

Internet-Drafts are working documents of the Internet Engineering | Official Protocol Standards" (STD 1) for the standardization state | |||

Task Force (IETF), its areas, and its working groups. Note | and status of this protocol. Distribution of this memo is unlimited. | |||

that other groups may also distribute working documents as | ||||

Internet-Drafts. | ||||

Internet-Drafts are draft documents valid for a maximum of six | ||||

months and may be updated, replaced, or obsoleted by other documents | ||||

at any time. It is inappropriate to use Internet-Drafts as | ||||

reference material or to cite them other than as "work in progress." | ||||

The list of current Internet-Drafts can be accessed at | ||||

http://www.ietf.org/ietf/1id-abstracts.txt | ||||

The list of Internet-Draft Shadow Directories can be accessed at | Copyright Notice | |||

http://www.ietf.org/shadow.html | ||||

Distribution of this document is unlimited. | Copyright (C) The Internet Society (2003). All Rights Reserved. | |||

Abstract | Abstract | |||

Punycode is a simple and efficient transfer encoding syntax designed | Punycode is a simple and efficient transfer encoding syntax designed | |||

for use with Internationalized Domain Names in Applications [IDNA]. | for use with Internationalized Domain Names in Applications (IDNA). | |||

It uniquely and reversibly transforms a Unicode string [UNICODE] | It uniquely and reversibly transforms a Unicode string into an ASCII | |||

into an ASCII string [ASCII]. ASCII characters in the Unicode | string. ASCII characters in the Unicode string are represented | |||

string are represented literally, and non-ASCII characters are | literally, and non-ASCII characters are represented by ASCII | |||

represented by ASCII characters that are allowed in host name labels | characters that are allowed in host name labels (letters, digits, and | |||

(letters, digits, and hyphens). This document defines a general | hyphens). This document defines a general algorithm called | |||

algorithm called Bootstring that allows a string of basic code | Bootstring that allows a string of basic code points to uniquely | |||

points to uniquely represent any string of code points drawn from | represent any string of code points drawn from a larger set. | |||

a larger set. Punycode is an instance of Bootstring that uses | Punycode is an instance of Bootstring that uses particular parameter | |||

particular parameter values specified by this document, appropriate | values specified by this document, appropriate for IDNA. | |||

for IDNA. | ||||

Contents | Table of Contents | |||

1. Introduction | 1. Introduction...............................................2 | |||

1.1 Features | 1.1 Features..............................................2 | |||

1.2 Interaction of protocol parts | 1.2 Interaction of protocol parts.........................3 | |||

2. Terminology | 2. Terminology................................................3 | |||

3. Bootstring description | 3. Bootstring description.....................................4 | |||

3.1 Basic code point segregation | 3.1 Basic code point segregation..........................4 | |||

3.2 Insertion unsort coding | 3.2 Insertion unsort coding...............................4 | |||

3.3 Generalized variable-length integers | 3.3 Generalized variable-length integers..................5 | |||

3.4 Bias adaptation | 3.4 Bias adaptation.......................................7 | |||

4. Bootstring parameters | 4. Bootstring parameters......................................8 | |||

5. Parameter values for Punycode | 5. Parameter values for Punycode..............................8 | |||

6. Bootstring algorithms | 6. Bootstring algorithms......................................9 | |||

6.1 Bias adaptation function | 6.1 Bias adaptation function.............................10 | |||

6.2 Decoding procedure | 6.2 Decoding procedure...................................11 | |||

6.3 Encoding procedure | 6.3 Encoding procedure...................................12 | |||

6.4 Overflow handling | 6.4 Overflow handling....................................13 | |||

7. Punycode examples | 7. Punycode examples.........................................14 | |||

7.1 Sample strings | 7.1 Sample strings.......................................14 | |||

7.2 Decoding traces | 7.2 Decoding traces......................................17 | |||

7.3 Encoding traces | 7.3 Encoding traces......................................19 | |||

8. Security considerations | 8. Security Considerations...................................20 | |||

9. References (non-normative) | 9. References................................................21 | |||

A. Author contact information | 9.1 Normative References.................................21 | |||

B. Mixed-case annotation | 9.2 Informative References...............................21 | |||

C. Disclaimer and license | A. Mixed-case annotation.....................................22 | |||

D. Punycode sample implementation | B. Disclaimer and license....................................22 | |||

C. Punycode sample implementation............................23 | ||||

Author's Address.............................................34 | ||||

Full Copyright Statement.....................................35 | ||||

1. Introduction | 1. Introduction | |||

[IDNA] describes an architecture for supporting internationalized | [IDNA] describes an architecture for supporting internationalized | |||

domain names. Labels containing non-ASCII characters can be | domain names. Labels containing non-ASCII characters can be | |||

represented by ACE labels, which begin with a special ACE prefix and | represented by ACE labels, which begin with a special ACE prefix and | |||

contain only ASCII characters. The remainder of the label after the | contain only ASCII characters. The remainder of the label after the | |||

prefix is a Punycode encoding of a Unicode string satisfying certain | prefix is a Punycode encoding of a Unicode string satisfying certain | |||

constraints. For the details of the prefix and constraints, see | constraints. For the details of the prefix and constraints, see | |||

[IDNA] and [NAMEPREP]. | [IDNA] and [NAMEPREP]. | |||

Punycode is an instance of a more general algorithm called | ||||

Bootstring, which allows strings composed from a small set of "basic" | ||||

code points to uniquely represent any string of code points drawn | ||||

from a larger set. Punycode is Bootstring with particular parameter | ||||

values appropriate for IDNA. | ||||

1.1 Features | 1.1 Features | |||

Bootstring has been designed to have the following features: | Bootstring has been designed to have the following features: | |||

* Completeness: Every extended string (sequence of arbitrary code | * Completeness: Every extended string (sequence of arbitrary code | |||

points) can be represented by a basic string (sequence of basic | points) can be represented by a basic string (sequence of basic | |||

code points). Restrictions on what strings are allowed, and on | code points). Restrictions on what strings are allowed, and on | |||

length, can be imposed by higher layers. | length, can be imposed by higher layers. | |||

* Uniqueness: There is at most one basic string that represents a | * Uniqueness: There is at most one basic string that represents a | |||

given extended string. | given extended string. | |||

* Reversibility: Any extended string mapped to a basic string can | * Reversibility: Any extended string mapped to a basic string can | |||

be recovered from that basic string. | be recovered from that basic string. | |||

* Efficient encoding: The ratio of basic string length to | * Efficient encoding: The ratio of basic string length to extended | |||

extended string length is small. This is important in the | string length is small. This is important in the context of | |||

context of domain names because RFC 1034 [RFC1034] restricts the | domain names because RFC 1034 [RFC1034] restricts the length of a | |||

length of a domain label to 63 characters. | domain label to 63 characters. | |||

* Simplicity: The encoding and decoding algorithms are reasonably | * Simplicity: The encoding and decoding algorithms are reasonably | |||

simple to implement. The goals of efficiency and simplicity are | simple to implement. The goals of efficiency and simplicity are | |||

at odds; Bootstring aims at a good balance between them. | at odds; Bootstring aims at a good balance between them. | |||

* Readability: Basic code points appearing in the extended string | * Readability: Basic code points appearing in the extended string | |||

are represented as themselves in the basic string (although the | are represented as themselves in the basic string (although the | |||

main purpose is to improve efficiency, not readability). | main purpose is to improve efficiency, not readability). | |||

Punycode can also support an additional feature that is not used | Punycode can also support an additional feature that is not used by | |||

by the ToASCII and ToUnicode operations of [IDNA]. When extended | the ToASCII and ToUnicode operations of [IDNA]. When extended | |||

strings are case-folded prior to encoding, the basic string can | strings are case-folded prior to encoding, the basic string can use | |||

use mixed case to tell how to convert the folded string into a | mixed case to tell how to convert the folded string into a mixed-case | |||

mixed-case string. See appendix B "Mixed-case annotation". | string. See appendix A "Mixed-case annotation". | |||

1.2 Interaction of protocol parts | 1.2 Interaction of protocol parts | |||

Punycode is used by the IDNA protocol [IDNA] for converting domain | Punycode is used by the IDNA protocol [IDNA] for converting domain | |||

labels into ASCII; it is not designed for any other purpose. It is | labels into ASCII; it is not designed for any other purpose. It is | |||

explicitly not designed for processing arbitrary free text. | explicitly not designed for processing arbitrary free text. | |||

2. Terminology | 2. Terminology | |||

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||

"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||

document are to be interpreted as described in RFC 2119 [RFC2119]. | document are to be interpreted as described in BCP 14, RFC 2119 | |||

[RFC2119]. | ||||

A code point is an integral value associated with a character in a | A code point is an integral value associated with a character in a | |||

coded character set. | coded character set. | |||

As in the Unicode Standard [UNICODE], Unicode code points are | As in the Unicode Standard [UNICODE], Unicode code points are denoted | |||

denoted by "U+" followed by four to six hexadecimal digits, while a | by "U+" followed by four to six hexadecimal digits, while a range of | |||

range of code points is denoted by two hexadecimal numbers separated | code points is denoted by two hexadecimal numbers separated by "..", | |||

by "..", with no prefixes. | with no prefixes. | |||

The operators div and mod perform integer division; (x div y) is the | The operators div and mod perform integer division; (x div y) is the | |||

quotient of x divided by y, discarding the remainder, and (x mod y) | quotient of x divided by y, discarding the remainder, and (x mod y) | |||

is the remainder, so (x div y) * y + (x mod y) == x. Bootstring | is the remainder, so (x div y) * y + (x mod y) == x. Bootstring uses | |||

uses these operators only with nonnegative operands, so the quotient | these operators only with nonnegative operands, so the quotient and | |||

and remainder are always nonnegative. | remainder are always nonnegative. | |||

The break statement jumps out of the innermost loop (as in C). | The break statement jumps out of the innermost loop (as in C). | |||

An overflow is an attempt to compute a value that exceeds the | An overflow is an attempt to compute a value that exceeds the maximum | |||

maximum value of an integer variable. | value of an integer variable. | |||

3. Bootstring description | 3. Bootstring description | |||

Bootstring represents an arbitrary sequence of code points (the | Bootstring represents an arbitrary sequence of code points (the | |||

"extended string") as a sequence of basic code points (the "basic | "extended string") as a sequence of basic code points (the "basic | |||

string"). This section describes the representation. Section 6 | string"). This section describes the representation. Section 6 | |||

"Bootstring algorithms" presents the algorithms as pseudocode. | "Bootstring algorithms" presents the algorithms as pseudocode. | |||

Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the | Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the | |||

algorithms for sample inputs. | algorithms for sample inputs. | |||

The following sections describe the four techniques used in | The following sections describe the four techniques used in | |||

Bootstring. "Basic code point segregation" is a very simple | Bootstring. "Basic code point segregation" is a very simple and | |||

and efficient encoding for basic code points occurring in the | efficient encoding for basic code points occurring in the extended | |||

extended string: they are simply copied all at once. "Insertion | string: they are simply copied all at once. "Insertion unsort | |||

unsort coding" encodes the non-basic code points as deltas, and | coding" encodes the non-basic code points as deltas, and processes | |||

processes the code points in numerical order rather than in order of | the code points in numerical order rather than in order of | |||

appearance, which typically results in smaller deltas. The deltas | appearance, which typically results in smaller deltas. The deltas | |||

are represented as "generalized variable-length integers", which use | are represented as "generalized variable-length integers", which use | |||

basic code points to represent nonnegative integers. The parameters | basic code points to represent nonnegative integers. The parameters | |||

of this integer representation are dynamically adjusted using "bias | of this integer representation are dynamically adjusted using "bias | |||

adaptation", to improve efficiency when consecutive deltas have | adaptation", to improve efficiency when consecutive deltas have | |||

similar magnitudes. | similar magnitudes. | |||

3.1 Basic code point segregation | 3.1 Basic code point segregation | |||

All basic code points appearing in the extended string are | All basic code points appearing in the extended string are | |||

represented literally at the beginning of the basic string, in their | represented literally at the beginning of the basic string, in their | |||

original order, followed by a delimiter if (and only if) the number | original order, followed by a delimiter if (and only if) the number | |||

of basic code points is nonzero. The delimiter is a particular | of basic code points is nonzero. The delimiter is a particular basic | |||

basic code point, which never appears in the remainder of the basic | code point, which never appears in the remainder of the basic string. | |||

string. The decoder can therefore find the end of the literal | The decoder can therefore find the end of the literal portion (if | |||

portion (if there is one) by scanning for the last delimiter. | there is one) by scanning for the last delimiter. | |||

3.2 Insertion unsort coding | 3.2 Insertion unsort coding | |||

The remainder of the basic string (after the last delimiter if there | The remainder of the basic string (after the last delimiter if there | |||

is one) represents a sequence of nonnegative integral deltas as | is one) represents a sequence of nonnegative integral deltas as | |||

generalized variable-length integers, described in section 3.3. The | generalized variable-length integers, described in section 3.3. The | |||

meaning of the deltas is best understood in terms of the decoder. | meaning of the deltas is best understood in terms of the decoder. | |||

The decoder builds the extended string incrementally. Initially, | The decoder builds the extended string incrementally. Initially, the | |||

the extended string is a copy of the literal portion of the basic | extended string is a copy of the literal portion of the basic string | |||

string (excluding the last delimiter). The decoder inserts | (excluding the last delimiter). The decoder inserts non-basic code | |||

non-basic code points, one for each delta, into the extended string, | points, one for each delta, into the extended string, ultimately | |||

ultimately arriving at the final decoded string. | arriving at the final decoded string. | |||

At the heart of this process is a state machine with two state | At the heart of this process is a state machine with two state | |||

variables: an index i and a counter n. The index i refers to | variables: an index i and a counter n. The index i refers to a | |||

a position in the extended string; it ranges from 0 (the first | position in the extended string; it ranges from 0 (the first | |||

position) to the current length of the extended string (which refers | position) to the current length of the extended string (which refers | |||

to a potential position beyond the current end). If the current | to a potential position beyond the current end). If the current | |||

state is <n,i>, the next state is <n,i+1> if i is less than the | state is <n,i>, the next state is <n,i+1> if i is less than the | |||

length of the extended string, or <n+1,0> if i equals the length of | length of the extended string, or <n+1,0> if i equals the length of | |||

the extended string. In other words, each state change causes i to | the extended string. In other words, each state change causes i to | |||

increment, wrapping around to zero if necessary, and n counts the | increment, wrapping around to zero if necessary, and n counts the | |||

number of wrap-arounds. | number of wrap-arounds. | |||

Notice that the state always advances monotonically (there is no | Notice that the state always advances monotonically (there is no way | |||

way for the decoder to return to an earlier state). At each state, | for the decoder to return to an earlier state). At each state, an | |||

an insertion is either performed or not performed. At most one | insertion is either performed or not performed. At most one | |||

insertion is performed in a given state. An insertion inserts the | insertion is performed in a given state. An insertion inserts the | |||

value of n at position i in the extended string. The deltas are | value of n at position i in the extended string. The deltas are a | |||

a run-length encoding of this sequence of events: they are the | run-length encoding of this sequence of events: they are the lengths | |||

lengths of the runs of non-insertion states preceeding the insertion | of the runs of non-insertion states preceeding the insertion states. | |||

states. Hence, for each delta, the decoder performs delta state | Hence, for each delta, the decoder performs delta state changes, then | |||

changes, then an insertion, and then one more state change. (An | an insertion, and then one more state change. (An implementation | |||

implementation need not perform each state change individually, but | need not perform each state change individually, but can instead use | |||

can instead use division and remainder calculations to compute the | division and remainder calculations to compute the next insertion | |||

next insertion state directly.) It is an error if the inserted code | state directly.) It is an error if the inserted code point is a | |||

point is a basic code point (because basic code points were supposed | basic code point (because basic code points were supposed to be | |||

to be segregated as described in section 3.1). | segregated as described in section 3.1). | |||

The encoder's main task is to derive the sequence of deltas that | The encoder's main task is to derive the sequence of deltas that will | |||

will cause the decoder to construct the desired string. It can do | cause the decoder to construct the desired string. It can do this by | |||

this by repeatedly scanning the extended string for the next code | repeatedly scanning the extended string for the next code point that | |||

point that the decoder would need to insert, and counting the number | the decoder would need to insert, and counting the number of state | |||

of state changes the decoder would need to perform, mindful of the | changes the decoder would need to perform, mindful of the fact that | |||

fact that the decoder's extended string will include only those | the decoder's extended string will include only those code points | |||

code points that have already been inserted. Section 6.3 "Encoding | that have already been inserted. Section 6.3 "Encoding procedure" | |||

procedure" gives a precise algorithm. | gives a precise algorithm. | |||

3.3 Generalized variable-length integers | 3.3 Generalized variable-length integers | |||

In a conventional integer representation the base is the number of | In a conventional integer representation the base is the number of | |||

distinct symbols for digits, whose values are 0 through base-1. Let | distinct symbols for digits, whose values are 0 through base-1. Let | |||

digit_0 denote the least significant digit, digit_1 the next least | digit_0 denote the least significant digit, digit_1 the next least | |||

significant, and so on. The value represented is the sum over j of | significant, and so on. The value represented is the sum over j of | |||

digit_j * w(j), where w(j) = base^j is the weight (scale factor) | digit_j * w(j), where w(j) = base^j is the weight (scale factor) for | |||

for position j. For example, in the base 8 integer 437, the digits | position j. For example, in the base 8 integer 437, the digits are | |||

are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is | 7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 + | |||

7 + 3*8 + 4*64 = 287. This representation has two disadvantages: | 3*8 + 4*64 = 287. This representation has two disadvantages: First, | |||

First, there are multiple encodings of each value (because there | there are multiple encodings of each value (because there can be | |||

can be extra zeros in the most significant positions), which is | extra zeros in the most significant positions), which is inconvenient | |||

inconvenient when unique encodings are needed. Second, the integer | when unique encodings are needed. Second, the integer is not self- | |||

is not self-delimiting, so if multiple integers are concatenated the | delimiting, so if multiple integers are concatenated the boundaries | |||

boundaries between them are lost. | between them are lost. | |||

The generalized variable-length representation solves these two | The generalized variable-length representation solves these two | |||

problems. The digit values are still 0 through base-1, but now | problems. The digit values are still 0 through base-1, but now the | |||

the integer is self-delimiting by means of thresholds t(j), each | integer is self-delimiting by means of thresholds t(j), each of which | |||

of which is in the range 0 through base-1. Exactly one digit, the | is in the range 0 through base-1. Exactly one digit, the most | |||

most significant, satisfies digit_j < t(j). Therefore, if several | significant, satisfies digit_j < t(j). Therefore, if several | |||

integers are concatenated, it is easy to separate them, starting | integers are concatenated, it is easy to separate them, starting with | |||

with the first if they are little-endian (least significant digit | the first if they are little-endian (least significant digit first), | |||

first), or starting with the last if they are big-endian (most | or starting with the last if they are big-endian (most significant | |||

significant digit first). As before, the value is the sum over j of | digit first). As before, the value is the sum over j of digit_j * | |||

digit_j * w(j), but the weights are different: | w(j), but the weights are different: | |||

w(0) = 1 | w(0) = 1 | |||

w(j) = w(j-1) * (base - t(j-1)) for j > 0 | w(j) = w(j-1) * (base - t(j-1)) for j > 0 | |||

For example, consider the little-endian sequence of base 8 digits | For example, consider the little-endian sequence of base 8 digits | |||

734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This | 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This | |||

implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) | implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) = | |||

= 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is | 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is not | |||

not less than 3, but 4 is less than 5, so 4 is the last digit. The | less than 3, but 4 is less than 5, so 4 is the last digit. The value | |||

value of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251, | of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251, with | |||

with value 2*1 + 5*6 + 1*30 = 62. Decoding this representation | value 2*1 + 5*6 + 1*30 = 62. Decoding this representation is very | |||

is very similar to decoding a conventional integer: Start with a | similar to decoding a conventional integer: Start with a current | |||

current value of N = 0 and a weight w = 1. Fetch the next digit d | value of N = 0 and a weight w = 1. Fetch the next digit d and | |||

and increase N by d * w. If d is less than the current threshold | increase N by d * w. If d is less than the current threshold (t) | |||

(t) then stop, otherwise increase w by a factor of (base - t), | then stop, otherwise increase w by a factor of (base - t), update t | |||

update t for the next position, and repeat. | for the next position, and repeat. | |||

Encoding this representation is similar to encoding a conventional | Encoding this representation is similar to encoding a conventional | |||

integer: If N < t then output one digit for N and stop, otherwise | integer: If N < t then output one digit for N and stop, otherwise | |||

output the digit for t + ((N - t) mod (base - t)), then replace N | output the digit for t + ((N - t) mod (base - t)), then replace N | |||

with (N - t) div (base - t), update t for the next position, and | with (N - t) div (base - t), update t for the next position, and | |||

repeat. | repeat. | |||

For any particular set of values of t(j), there is exactly one | For any particular set of values of t(j), there is exactly one | |||

generalized variable-length representation of each nonnegative | generalized variable-length representation of each nonnegative | |||

integral value. | integral value. | |||

skipping to change at line 288 | skipping to change at page 7, line 4 | |||

generalized variable-length representation of each nonnegative | generalized variable-length representation of each nonnegative | |||

integral value. | integral value. | |||

Bootstring uses little-endian ordering so that the deltas can be | Bootstring uses little-endian ordering so that the deltas can be | |||

separated starting with the first. The t(j) values are defined in | separated starting with the first. The t(j) values are defined in | |||

terms of the constants base, tmin, and tmax, and a state variable | terms of the constants base, tmin, and tmax, and a state variable | |||

called bias: | called bias: | |||

t(j) = base * (j + 1) - bias, | t(j) = base * (j + 1) - bias, | |||

clamped to the range tmin through tmax | clamped to the range tmin through tmax | |||

The clamping means that if the formula yields a value less than tmin | The clamping means that if the formula yields a value less than tmin | |||

or greater than tmax, then t(j) = tmin or tmax, respectively. (In | or greater than tmax, then t(j) = tmin or tmax, respectively. (In | |||

the pseudocode in section 6 "Bootstring algorithms", the expression | the pseudocode in section 6 "Bootstring algorithms", the expression | |||

base * (j + 1) is denoted by k for performance reasons.) These | base * (j + 1) is denoted by k for performance reasons.) These t(j) | |||

t(j) values cause the representation to favor integers within a | values cause the representation to favor integers within a particular | |||

particular range determined by the bias. | range determined by the bias. | |||

3.4 Bias adaptation | 3.4 Bias adaptation | |||

After each delta is encoded or decoded, bias is set for the next | After each delta is encoded or decoded, bias is set for the next | |||

delta as follows: | delta as follows: | |||

1. Delta is scaled in order to avoid overflow in the next step: | 1. Delta is scaled in order to avoid overflow in the next step: | |||

let delta = delta div 2 | let delta = delta div 2 | |||

But when this is the very first delta, the divisor is not 2, but | But when this is the very first delta, the divisor is not 2, but | |||

instead a constant called damp. This compensates for the fact | instead a constant called damp. This compensates for the fact | |||

that the second delta is usually much smaller than the first. | that the second delta is usually much smaller than the first. | |||

2. Delta is increased to compensate for the fact that the next | 2. Delta is increased to compensate for the fact that the next delta | |||

delta will be inserting into a longer string: | will be inserting into a longer string: | |||

let delta = delta + (delta div numpoints) | let delta = delta + (delta div numpoints) | |||

numpoints is the total number of code points encoded/decoded so | numpoints is the total number of code points encoded/decoded so | |||

far (including the one corresponding to this delta itself, and | far (including the one corresponding to this delta itself, and | |||

including the basic code points). | including the basic code points). | |||

3. Delta is repeatedly divided until it falls within a threshold, | 3. Delta is repeatedly divided until it falls within a threshold, to | |||

to predict the minimum number of digits needed to represent the | predict the minimum number of digits needed to represent the next | |||

next delta: | delta: | |||

while delta > ((base - tmin) * tmax) div 2 | while delta > ((base - tmin) * tmax) div 2 | |||

do let delta = delta div (base - tmin) | do let delta = delta div (base - tmin) | |||

4. The bias is set: | 4. The bias is set: | |||

let bias = | let bias = | |||

(base * the number of divisions performed in step 3) + | (base * the number of divisions performed in step 3) + | |||

(((base - tmin + 1) * delta) div (delta + skew)) | (((base - tmin + 1) * delta) div (delta + skew)) | |||

The motivation for this procedure is that the current delta provides | The motivation for this procedure is that the current delta | |||

a hint about the likely size of the next delta, and so t(j) is | provides a hint about the likely size of the next delta, and so | |||

set to tmax for the more significant digits starting with the one | t(j) is set to tmax for the more significant digits starting with | |||

expected to be last, tmin for the less significant digits up through | the one expected to be last, tmin for the less significant digits | |||

the one expected to be third-last, and somewhere between tmin and | up through the one expected to be third-last, and somewhere | |||

tmax for the digit expected to be second-last (balancing the hope of | between tmin and tmax for the digit expected to be second-last | |||

the expected-last digit being unnecessary against the danger of it | (balancing the hope of the expected-last digit being unnecessary | |||

being insufficient). | against the danger of it being insufficient). | |||

4. Bootstring parameters | 4. Bootstring parameters | |||

Given a set of basic code points, one needs to be designated as | Given a set of basic code points, one needs to be designated as the | |||

the delimiter. The base cannot be greater than the number of | delimiter. The base cannot be greater than the number of | |||

distinguishable basic code points remaining. The digit-values in | distinguishable basic code points remaining. The digit-values in the | |||

the range 0 through base-1 need to be associated with distinct | range 0 through base-1 need to be associated with distinct non- | |||

non-delimiter basic code points. In some cases multiple code points | delimiter basic code points. In some cases multiple code points need | |||

need to have the same digit-value; for example, uppercase and | to have the same digit-value; for example, uppercase and lowercase | |||

lowercase versions of the same letter need to be equivalent if basic | versions of the same letter need to be equivalent if basic strings | |||

strings are case-insensitive. | are case-insensitive. | |||

The initial value of n cannot be greater than the minimum non-basic | The initial value of n cannot be greater than the minimum non-basic | |||

code point that could appear in extended strings. | code point that could appear in extended strings. | |||

The remaining five parameters (tmin, tmax, skew, damp, and the | The remaining five parameters (tmin, tmax, skew, damp, and the | |||

initial value of bias) need to satisfy the following constraints: | initial value of bias) need to satisfy the following constraints: | |||

0 <= tmin <= tmax <= base-1 | 0 <= tmin <= tmax <= base-1 | |||

skew >= 1 | skew >= 1 | |||

damp >= 2 | damp >= 2 | |||

initial_bias mod base <= base - tmin | initial_bias mod base <= base - tmin | |||

Provided the constraints are satisfied, these five parameters affect | Provided the constraints are satisfied, these five parameters affect | |||

efficiency but not correctness. They are best chosen empirically. | efficiency but not correctness. They are best chosen empirically. | |||

If support for mixed-case annotation is desired (see appendix B), | If support for mixed-case annotation is desired (see appendix A), | |||

make sure that the code points corresponding to 0 through tmax-1 all | make sure that the code points corresponding to 0 through tmax-1 all | |||

have both uppercase and lowercase forms. | have both uppercase and lowercase forms. | |||

5. Parameter values for Punycode | 5. Parameter values for Punycode | |||

Punycode uses the following Bootstring parameter values: | Punycode uses the following Bootstring parameter values: | |||

base = 36 | base = 36 | |||

tmin = 1 | tmin = 1 | |||

tmax = 26 | tmax = 26 | |||

skew = 38 | skew = 38 | |||

damp = 700 | damp = 700 | |||

initial_bias = 72 | initial_bias = 72 | |||

initial_n = 128 = 0x80 | initial_n = 128 = 0x80 | |||

Although the only restriction Punycode imposes on the input integers | Although the only restriction Punycode imposes on the input integers | |||

is that they be nonnegative, these parameters are especially | is that they be nonnegative, these parameters are especially designed | |||

designed to work well with Unicode [UNICODE] code points, which | to work well with Unicode [UNICODE] code points, which are integers | |||

are integers in the range 0..10FFFF (but not D800..DFFF, which are | in the range 0..10FFFF (but not D800..DFFF, which are reserved for | |||

reserved for use by the UTF-16 encoding of Unicode). The basic code | use by the UTF-16 encoding of Unicode). The basic code points are | |||

points are the ASCII [ASCII] code points (0..7F), of which U+002D | the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the | |||

(-) is the delimiter, and some of the others have digit-values as | delimiter, and some of the others have digit-values as follows: | |||

follows: | ||||

code points digit-values | code points digit-values | |||

------------ ---------------------- | ------------ ---------------------- | |||

41..5A (A-Z) = 0 to 25, respectively | 41..5A (A-Z) = 0 to 25, respectively | |||

61..7A (a-z) = 0 to 25, respectively | 61..7A (a-z) = 0 to 25, respectively | |||

30..39 (0-9) = 26 to 35, respectively | 30..39 (0-9) = 26 to 35, respectively | |||

Using hyphen-minus as the delimiter implies that the encoded string | Using hyphen-minus as the delimiter implies that the encoded string | |||

can end with a hyphen-minus only if the Unicode string consists | can end with a hyphen-minus only if the Unicode string consists | |||

entirely of basic code points, but IDNA forbids such strings from | entirely of basic code points, but IDNA forbids such strings from | |||

being encoded. The encoded string can begin with a hyphen-minus, | being encoded. The encoded string can begin with a hyphen-minus, but | |||

but IDNA prepends a prefix. Therefore IDNA using Punycode conforms | IDNA prepends a prefix. Therefore IDNA using Punycode conforms to | |||

to the RFC 952 rule that host name labels neither begin nor end with | the RFC 952 rule that host name labels neither begin nor end with a | |||

a hyphen-minus [RFC952]. | hyphen-minus [RFC952]. | |||

A decoder MUST recognize the letters in both uppercase and lowercase | A decoder MUST recognize the letters in both uppercase and lowercase | |||

forms (including mixtures of both forms). An encoder SHOULD output | forms (including mixtures of both forms). An encoder SHOULD output | |||

only uppercase forms or only lowercase forms, unless it uses | only uppercase forms or only lowercase forms, unless it uses mixed- | |||

mixed-case annotation (see appendix B). | case annotation (see appendix A). | |||

Presumably most users will not manually write or type encoded | Presumably most users will not manually write or type encoded strings | |||

strings (as opposed to cutting and pasting them), but those who do | (as opposed to cutting and pasting them), but those who do will need | |||

will need to be alert to the potential visual ambiguity between the | to be alert to the potential visual ambiguity between the following | |||

following sets of characters: | sets of characters: | |||

G 6 | G 6 | |||

I l 1 | I l 1 | |||

O 0 | O 0 | |||

S 5 | S 5 | |||

U V | U V | |||

Z 2 | Z 2 | |||

Such ambiguities are usually resolved by context, but in a Punycode | Such ambiguities are usually resolved by context, but in a Punycode | |||

encoded string there is no context apparent to humans. | encoded string there is no context apparent to humans. | |||

6. Bootstring algorithms | 6. Bootstring algorithms | |||

Some parts of the pseudocode can be omitted if the parameters | Some parts of the pseudocode can be omitted if the parameters satisfy | |||

satisfy certain conditions (for which Punycode qualifies). These | certain conditions (for which Punycode qualifies). These parts are | |||

parts are enclosed in {braces}, and notes immediately following the | enclosed in {braces}, and notes immediately following the pseudocode | |||

pseudocode explain the conditions under which they can be omitted. | explain the conditions under which they can be omitted. | |||

Formally, code points are integers, and hence the pseudocode assumes | Formally, code points are integers, and hence the pseudocode assumes | |||

that arithmetic operations can be performed directly on code points. | that arithmetic operations can be performed directly on code points. | |||

In some programming languages, explicit conversion between code | In some programming languages, explicit conversion between code | |||

points and integers might be necessary. | points and integers might be necessary. | |||

6.1 Bias adaptation function | 6.1 Bias adaptation function | |||

function adapt(delta,numpoints,firsttime): | function adapt(delta,numpoints,firsttime): | |||

if firsttime then let delta = delta div damp | if firsttime then let delta = delta div damp | |||

else let delta = delta div 2 | else let delta = delta div 2 | |||

let delta = delta + (delta div numpoints) | let delta = delta + (delta div numpoints) | |||

let k = 0 | let k = 0 | |||

while delta > ((base - tmin) * tmax) div 2 do begin | while delta > ((base - tmin) * tmax) div 2 do begin | |||

let delta = delta div (base - tmin) | let delta = delta div (base - tmin) | |||

let k = k + base | let k = k + base | |||

end | end | |||

return k + (((base - tmin + 1) * delta) div (delta + skew)) | return k + (((base - tmin + 1) * delta) div (delta + skew)) | |||

It does not matter whether the modifications to delta and k | It does not matter whether the modifications to delta and k inside | |||

inside adapt() affect variables of the same name inside the | adapt() affect variables of the same name inside the | |||

encoding/decoding procedures, because after calling adapt() the | encoding/decoding procedures, because after calling adapt() the | |||

caller does not read those variables before overwriting them. | caller does not read those variables before overwriting them. | |||

6.2 Decoding procedure | 6.2 Decoding procedure | |||

let n = initial_n | let n = initial_n | |||

let i = 0 | let i = 0 | |||

let bias = initial_bias | let bias = initial_bias | |||

let output = an empty string indexed from 0 | let output = an empty string indexed from 0 | |||

consume all code points before the last delimiter (if there is one) | consume all code points before the last delimiter (if there is one) | |||

skipping to change at line 485 | skipping to change at page 11, line 36 | |||

end | end | |||

let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) | let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) | |||

let n = n + i div (length(output) + 1), fail on overflow | let n = n + i div (length(output) + 1), fail on overflow | |||

let i = i mod (length(output) + 1) | let i = i mod (length(output) + 1) | |||

{if n is a basic code point then fail} | {if n is a basic code point then fail} | |||

insert n into output at position i | insert n into output at position i | |||

increment i | increment i | |||

end | end | |||

The full statement enclosed in braces (checking whether n is a basic | The full statement enclosed in braces (checking whether n is a basic | |||

code point) can be omitted if initial_n exceeds all basic code | code point) can be omitted if initial_n exceeds all basic code points | |||

points (which is true for Punycode), because n is never less than | (which is true for Punycode), because n is never less than initial_n. | |||

initial_n. | ||||

In the assignment of t, where t is clamped to the range tmin through | In the assignment of t, where t is clamped to the range tmin through | |||

tmax, "+ tmin" can always be omitted. This makes the clamping | tmax, "+ tmin" can always be omitted. This makes the clamping | |||

calculation incorrect when bias < k < bias + tmin, but that cannot | calculation incorrect when bias < k < bias + tmin, but that cannot | |||

happen because of the way bias is computed and because of the | happen because of the way bias is computed and because of the | |||

constraints on the parameters. | constraints on the parameters. | |||

Because the decoder state can only advance monotonically, and there | Because the decoder state can only advance monotonically, and there | |||

is only one representation of any delta, there is therefore only | is only one representation of any delta, there is therefore only one | |||

one encoded string that can represent a given sequence of integers. | encoded string that can represent a given sequence of integers. The | |||

The only error conditions are invalid code points, unexpected | only error conditions are invalid code points, unexpected end-of- | |||

end-of-input, overflow, and basic code points encoded using deltas | input, overflow, and basic code points encoded using deltas instead | |||

instead of appearing literally. If the decoder fails on these | of appearing literally. If the decoder fails on these errors as | |||

errors as shown above, then it cannot produce the same output for | shown above, then it cannot produce the same output for two distinct | |||

two distinct inputs. Without this property it would have been | inputs. Without this property it would have been necessary to re- | |||

necessary to re-encode the output and verify that it matches the | encode the output and verify that it matches the input in order to | |||

input in order to guarantee the uniqueness of the encoding. | guarantee the uniqueness of the encoding. | |||

6.3 Encoding procedure | 6.3 Encoding procedure | |||

let n = initial_n | let n = initial_n | |||

let delta = 0 | let delta = 0 | |||

let bias = initial_bias | let bias = initial_bias | |||

let h = b = the number of basic code points in the input | let h = b = the number of basic code points in the input | |||

copy them to the output in order, followed by a delimiter if b > 0 | copy them to the output in order, followed by a delimiter if b > 0 | |||

{if the input contains a non-basic code point < n then fail} | {if the input contains a non-basic code point < n then fail} | |||

while h < length(input) do begin | while h < length(input) do begin | |||

skipping to change at line 543 | skipping to change at page 12, line 44 | |||

end | end | |||

end | end | |||

increment delta and n | increment delta and n | |||

end | end | |||

The full statement enclosed in braces (checking whether the input | The full statement enclosed in braces (checking whether the input | |||

contains a non-basic code point less than n) can be omitted if all | contains a non-basic code point less than n) can be omitted if all | |||

code points less than initial_n are basic code points (which is true | code points less than initial_n are basic code points (which is true | |||

for Punycode if code points are unsigned). | for Punycode if code points are unsigned). | |||

The brace-enclosed conditions "non-basic" and "or m is basic" can be | The brace-enclosed conditions "non-basic" and "or c is basic" can be | |||

omitted if initial_n exceeds all basic code points (which is true | omitted if initial_n exceeds all basic code points (which is true for | |||

for Punycode), because the code point being tested is never less | Punycode), because the code point being tested is never less than | |||

than initial_n. | initial_n. | |||

In the assignment of t, where t is clamped to the range tmin through | In the assignment of t, where t is clamped to the range tmin through | |||

tmax, "+ tmin" can always be omitted. This makes the clamping | tmax, "+ tmin" can always be omitted. This makes the clamping | |||

calculation incorrect when bias < k < bias + tmin, but that cannot | calculation incorrect when bias < k < bias + tmin, but that cannot | |||

happen because of the way bias is computed and because of the | happen because of the way bias is computed and because of the | |||

constraints on the parameters. | constraints on the parameters. | |||

The checks for overflow are necessary to avoid producing invalid | The checks for overflow are necessary to avoid producing invalid | |||

output when the input contains very large values or is very long. | output when the input contains very large values or is very long. | |||

The increment of delta at the bottom of the outer loop cannot | The increment of delta at the bottom of the outer loop cannot | |||

overflow because delta < length(input) before the increment, and | overflow because delta < length(input) before the increment, and | |||

length(input) is already assumed to be representable. The increment | length(input) is already assumed to be representable. The increment | |||

of n could overflow, but only if h == length(input), in which case | of n could overflow, but only if h == length(input), in which case | |||

the procedure is finished anyway. | the procedure is finished anyway. | |||

6.4 Overflow handling | 6.4 Overflow handling | |||

For IDNA, 26-bit unsigned integers are sufficient to handle all | For IDNA, 26-bit unsigned integers are sufficient to handle all valid | |||

valid IDNA labels without overflow, because any string that | IDNA labels without overflow, because any string that needed a 27-bit | |||

needed a 27-bit delta would have to exceed either the code point | delta would have to exceed either the code point limit (0..10FFFF) or | |||

limit (0..10FFFF) or the label length limit (63 characters). | the label length limit (63 characters). However, overflow handling | |||

However, overflow handling is necessary because the inputs are not | is necessary because the inputs are not necessarily valid IDNA | |||

necessarily valid IDNA labels. | labels. | |||

If the programming language does not provide overflow detection, | If the programming language does not provide overflow detection, the | |||

the following technique can be used. Suppose A, B, and C are | following technique can be used. Suppose A, B, and C are | |||

representable nonnegative integers and C is nonzero. Then A + B | representable nonnegative integers and C is nonzero. Then A + B | |||

overflows if and only if B > maxint - A, and A + (B * C) overflows | overflows if and only if B > maxint - A, and A + (B * C) overflows if | |||

if and only if B > (maxint - A) div C, where maxint is the greatest | and only if B > (maxint - A) div C, where maxint is the greatest | |||

integer for which maxint + 1 cannot be represented. Refer to | integer for which maxint + 1 cannot be represented. Refer to | |||

appendix D "Punycode sample implementation" for demonstrations of | appendix C "Punycode sample implementation" for demonstrations of | |||

this technique in the C language. | this technique in the C language. | |||

The decoding and encoding algorithms shown in sections 6.2 and | The decoding and encoding algorithms shown in sections 6.2 and 6.3 | |||

6.3 handle overflow by detecting it whenever it happens. Another | handle overflow by detecting it whenever it happens. Another | |||

approach is to enforce limits on the inputs that prevent overflow | approach is to enforce limits on the inputs that prevent overflow | |||

from happening. For example, if the encoder were to verify that | from happening. For example, if the encoder were to verify that no | |||

no input code points exceed M and that the input length does not | input code points exceed M and that the input length does not exceed | |||

exceed L, then no delta could ever exceed (M - initial_n) * (L + 1), | L, then no delta could ever exceed (M - initial_n) * (L + 1), and | |||

and hence no overflow could occur if integer variables were capable | hence no overflow could occur if integer variables were capable of | |||

of representing values that large. This prevention approach would | representing values that large. This prevention approach would | |||

impose more restrictions on the input than the detection approach | impose more restrictions on the input than the detection approach | |||

does, but might be considered simpler in some programming languages. | does, but might be considered simpler in some programming languages. | |||

In theory, the decoder could use an analogous approach, limiting the | In theory, the decoder could use an analogous approach, limiting the | |||

number of digits in a variable-length integer (that is, limiting the | number of digits in a variable-length integer (that is, limiting the | |||

number of iterations in the innermost loop). However, the number | number of iterations in the innermost loop). However, the number of | |||

of digits that suffice to represent a given delta can sometimes | digits that suffice to represent a given delta can sometimes | |||

represent much larger deltas (because of the adaptation), and hence | represent much larger deltas (because of the adaptation), and hence | |||

this approach would probably need integers wider than 32 bits. | this approach would probably need integers wider than 32 bits. | |||

Yet another approach for the decoder is to allow overflow to occur, | Yet another approach for the decoder is to allow overflow to occur, | |||

but to check the final output string by re-encoding it and comparing | but to check the final output string by re-encoding it and comparing | |||

to the decoder input. If and only if they do not match (using a | to the decoder input. If and only if they do not match (using a | |||

case-insensitive ASCII comparison) overflow has occurred. This | case-insensitive ASCII comparison) overflow has occurred. This | |||

delayed-detection approach would not impose any more restrictions on | delayed-detection approach would not impose any more restrictions on | |||

the input than the immediate-detection approach does, and might be | the input than the immediate-detection approach does, and might be | |||

considered simpler in some programming languages. | considered simpler in some programming languages. | |||

In fact, if the decoder is used only inside the IDNA ToUnicode | In fact, if the decoder is used only inside the IDNA ToUnicode | |||

operation [IDNA], then it need not check for overflow at all, | operation [IDNA], then it need not check for overflow at all, because | |||

because ToUnicode performs a higher level re-encoding and | ToUnicode performs a higher level re-encoding and comparison, and a | |||

comparison, and a mismatch has the same consequence as if the | mismatch has the same consequence as if the Punycode decoder had | |||

Punycode decoder had failed. | failed. | |||

7. Punycode examples | 7. Punycode examples | |||

7.1 Sample strings | 7.1 Sample strings | |||

In the Punycode encodings below, the ACE prefix is not shown. | In the Punycode encodings below, the ACE prefix is not shown. | |||

Backslashes show where line breaks have been inserted in strings too | Backslashes show where line breaks have been inserted in strings too | |||

long for one line. | long for one line. | |||

The first several examples are all translations of the sentence "Why | The first several examples are all translations of the sentence "Why | |||

skipping to change at line 692 | skipping to change at page 16, line 4 | |||

Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a | Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a | |||

(K) Vietnamese: | (K) Vietnamese: | |||

T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\ | T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\ | |||

<ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t | <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t | |||

U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B | U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B | |||

u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 | u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 | |||

u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 | u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 | |||

U+0056 u+0069 u+1EC7 u+0074 | U+0056 u+0069 u+1EC7 u+0074 | |||

Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g | Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g | |||

The next several examples are all names of Japanese music artists, | The next several examples are all names of Japanese music artists, | |||

song titles, and TV programs, just because the author happens to | song titles, and TV programs, just because the author happens to have | |||

have them handy (but Japanese is useful for providing examples | them handy (but Japanese is useful for providing examples of single- | |||

of single-row text, two-row text, ideographic text, and various | row text, two-row text, ideographic text, and various mixtures | |||

mixtures thereof). | thereof). | |||

(L) 3<nen>B<gumi><kinpachi><sensei> | (L) 3<nen>B<gumi><kinpachi><sensei> | |||

u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F | u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F | |||

Punycode: 3B-ww4c5e180e575a65lsy2b | Punycode: 3B-ww4c5e180e575a65lsy2b | |||

(M) <amuro><namie>-with-SUPER-MONKEYS | (M) <amuro><namie>-with-SUPER-MONKEYS | |||

u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 | u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 | |||

u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D | u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D | |||

U+004F U+004E U+004B U+0045 U+0059 U+0053 | U+004F U+004E U+004B U+0045 U+0059 U+0053 | |||

Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n | Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n | |||

skipping to change at line 742 | skipping to change at page 17, line 7 | |||

for host name labels. (It is not a realistic example for IDNA, | for host name labels. (It is not a realistic example for IDNA, | |||

because IDNA never encodes pure ASCII labels.) | because IDNA never encodes pure ASCII labels.) | |||

(S) -> $1.00 <- | (S) -> $1.00 <- | |||

u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 | u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 | |||

u+003C u+002D | u+003C u+002D | |||

Punycode: -> $1.00 <-- | Punycode: -> $1.00 <-- | |||

7.2 Decoding traces | 7.2 Decoding traces | |||

In the following traces, the evolving state of the decoder is | In the following traces, the evolving state of the decoder is shown | |||

shown as a sequence of hexadecimal values, representing the code | as a sequence of hexadecimal values, representing the code points in | |||

points in the extended string. An asterisk appears just after the | the extended string. An asterisk appears just after the most | |||

most recently inserted code point, indicating both n (the value | recently inserted code point, indicating both n (the value preceeding | |||

preceeding the asterisk) and i (the position of the value just after | the asterisk) and i (the position of the value just after the | |||

the asterisk). Other numerical values are decimal. | asterisk). Other numerical values are decimal. | |||

Decoding trace of example B from section 7.1: | Decoding trace of example B from section 7.1: | |||

n is 128, i is 0, bias is 72 | n is 128, i is 0, bias is 72 | |||

input is "ihqwcrb4cv8a8dqg056pqjye" | input is "ihqwcrb4cv8a8dqg056pqjye" | |||

there is no delimiter, so extended string starts empty | there is no delimiter, so extended string starts empty | |||

delta "ihq" decodes to 19853 | delta "ihq" decodes to 19853 | |||

bias becomes 21 | bias becomes 21 | |||

4E0D * | 4E0D * | |||

delta "wc" decodes to 64 | delta "wc" decodes to 64 | |||

skipping to change at line 871 | skipping to change at page 20, line 30 | |||

needed delta is 34821, encodes as "575a" | needed delta is 34821, encodes as "575a" | |||

bias becomes 82 | bias becomes 82 | |||

next code point to insert is 7D44 | next code point to insert is 7D44 | |||

needed delta is 14592, encodes as "65l" | needed delta is 14592, encodes as "65l" | |||

bias becomes 67 | bias becomes 67 | |||

next code point to insert is 91D1 | next code point to insert is 91D1 | |||

needed delta is 42088, encodes as "sy2b" | needed delta is 42088, encodes as "sy2b" | |||

bias becomes 84 | bias becomes 84 | |||

output is "3B-ww4c5e180e575a65lsy2b" | output is "3B-ww4c5e180e575a65lsy2b" | |||

8. Security considerations | 8. Security Considerations | |||

Users expect each domain name in DNS to be controlled by a single | Users expect each domain name in DNS to be controlled by a single | |||

authority. If a Unicode string intended for use as a domain label | authority. If a Unicode string intended for use as a domain label | |||

could map to multiple ACE labels, then an internationalized domain | could map to multiple ACE labels, then an internationalized domain | |||

name could map to multiple ASCII domain names, each controlled by | name could map to multiple ASCII domain names, each controlled by a | |||

a different authority, some of which could be spoofs that hijack | different authority, some of which could be spoofs that hijack | |||

service requests intended for another. Therefore Punycode is | service requests intended for another. Therefore Punycode is | |||

designed so that each Unicode string has a unique encoding. | designed so that each Unicode string has a unique encoding. | |||

However, there can still be multiple Unicode representations of the | However, there can still be multiple Unicode representations of the | |||

"same" text, for various definitions of "same". This problem is | "same" text, for various definitions of "same". This problem is | |||

addressed to some extent by the Unicode standard under the topic of | addressed to some extent by the Unicode standard under the topic of | |||

canonicalization, and this work is leveraged for domain names by | canonicalization, and this work is leveraged for domain names by | |||

Nameprep [NAMEPREP]. | Nameprep [NAMEPREP]. | |||

9. References (non-normative) | 9. References | |||

[ASCII] Vint Cerf, "ASCII format for Network Interchange", | 9.1 Normative References | |||

1969-Oct-16, RFC 20. | ||||

[IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello, | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||

"Internationalizing Domain Names In Applications (IDNA)", | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||

draft-ietf-idn-idna. | ||||

[NAMEPREP] Paul Hoffman, Marc Blanchet, "Nameprep: A | 9.2 Informative References | |||

Stringprep Profile for Internationalized Domain Names", | ||||

draft-ietf-idn-nameprep. | ||||

[PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", | [RFC952] Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet | |||

http://www.trigeminal.com/samples/provincial.html. | Host Table Specification", RFC 952, October 1985. | |||

[RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host | [RFC1034] Mockapetris, P., "Domain Names - Concepts and | |||

Table Specification", 1985-Oct, RFC 952. | Facilities", STD 13, RFC 1034, November 1987. | |||

[RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", | [IDNA] Faltstrom, P., Hoffman, P. and A. Costello, | |||

1987-Nov, RFC 1034. | "Internationalizing Domain Names in Applications | |||

(IDNA)", RFC 3490, March 2003. | ||||

[UNICODE] The Unicode Consortium, "The Unicode Standard", | [NAMEPREP] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | |||

http://www.unicode.org/unicode/standard/standard.html. | Profile for Internationalized Domain Names (IDN)", RFC | |||

3491, March 2003. | ||||

A. Author contact information | [ASCII] Cerf, V., "ASCII format for Network Interchange", RFC | |||

20, October 1969. | ||||

Adam M. Costello | [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page", | |||

University of California, Berkeley | http://www.trigeminal.com/samples/provincial.html. | |||

http://www.nicemice.net/amc/ | ||||

B. Mixed-case annotation | [UNICODE] The Unicode Consortium, "The Unicode Standard", | |||

http://www.unicode.org/unicode/standard/standard.html. | ||||

A. Mixed-case annotation | ||||

In order to use Punycode to represent case-insensitive strings, | In order to use Punycode to represent case-insensitive strings, | |||

higher layers need to case-fold the strings prior to Punycode | higher layers need to case-fold the strings prior to Punycode | |||

encoding. The encoded string can use mixed case as an annotation | encoding. The encoded string can use mixed case as an annotation | |||

telling how to convert the folded string into a mixed-case string | telling how to convert the folded string into a mixed-case string for | |||

for display purposes. Note, however, that mixed-case annotation | display purposes. Note, however, that mixed-case annotation is not | |||

is not used by the ToASCII and ToUnicode operations specified in | used by the ToASCII and ToUnicode operations specified in [IDNA], and | |||

[IDNA], and therefore implementors of IDNA can disregard this | therefore implementors of IDNA can disregard this appendix. | |||

appendix. | ||||

Basic code points can use mixed case directly, because the decoder | Basic code points can use mixed case directly, because the decoder | |||

copies them verbatim, leaving lowercase code points lowercase, and | copies them verbatim, leaving lowercase code points lowercase, and | |||

leaving uppercase code points uppercase. Each non-basic code point | leaving uppercase code points uppercase. Each non-basic code point | |||

is represented by a delta, which is represented by a sequence of | is represented by a delta, which is represented by a sequence of | |||

basic code points, the last of which provides the annotation. If it | basic code points, the last of which provides the annotation. If it | |||

is uppercase, it is a suggestion to map the non-basic code point to | is uppercase, it is a suggestion to map the non-basic code point to | |||

uppercase (if possible); if it is lowercase, it is a suggestion to | uppercase (if possible); if it is lowercase, it is a suggestion to | |||

map the non-basic code point to lowercase (if possible). | map the non-basic code point to lowercase (if possible). | |||

These annotations do not alter the code points returned by decoders; | These annotations do not alter the code points returned by decoders; | |||

the annotations are returned separately, for the caller to use or | the annotations are returned separately, for the caller to use or | |||

ignore. Encoders can accept annotations in addition to code points, | ignore. Encoders can accept annotations in addition to code points, | |||

but the annotations do not alter the output, except to influence the | but the annotations do not alter the output, except to influence the | |||

uppercase/lowercase form of ASCII letters. | uppercase/lowercase form of ASCII letters. | |||

Punycode encoders and decoders need not support these annotations, | Punycode encoders and decoders need not support these annotations, | |||

and higher layers need not use them. | and higher layers need not use them. | |||

C. Disclaimer and license | B. Disclaimer and license | |||

Regarding this entire document or any portion of it (including | Regarding this entire document or any portion of it (including the | |||

the pseudocode and C code), the author makes no guarantees and | pseudocode and C code), the author makes no guarantees and is not | |||

is not responsible for any damage resulting from its use. The | responsible for any damage resulting from its use. The author grants | |||

author grants irrevocable permission to anyone to use, modify, | irrevocable permission to anyone to use, modify, and distribute it in | |||

and distribute it in any way that does not diminish the rights | any way that does not diminish the rights of anyone else to use, | |||

of anyone else to use, modify, and distribute it, provided that | modify, and distribute it, provided that redistributed derivative | |||

redistributed derivative works do not contain misleading author or | works do not contain misleading author or version information. | |||

version information. Derivative works need not be licensed under | Derivative works need not be licensed under similar terms. | |||

similar terms. | ||||

D. Punycode sample implementation | C. Punycode sample implementation | |||

/* | /* | |||

punycode.c from draft-ietf-idn-punycode-02 | punycode.c from RFC 3492 | |||

http://www.nicemice.net/idn/ | http://www.nicemice.net/idn/ | |||

Adam M. Costello | Adam M. Costello | |||

http://www.nicemice.net/amc/ | http://www.nicemice.net/amc/ | |||

This is ANSI C code (C89) implementing | This is ANSI C code (C89) implementing Punycode (RFC 3492). | |||

Punycode (draft-ietf-idn-punycode-02). | ||||

*/ | */ | |||

/************************************************************/ | /************************************************************/ | |||

/* Public interface (would normally go in its own .h file): */ | /* Public interface (would normally go in its own .h file): */ | |||

#include <limits.h> | #include <limits.h> | |||

enum punycode_status { | enum punycode_status { | |||

punycode_success, | punycode_success, | |||

skipping to change at line 1471 | skipping to change at page 34, line 5 | |||

if (r < 0) fail(io_error); | if (r < 0) fail(io_error); | |||

} | } | |||

return EXIT_SUCCESS; | return EXIT_SUCCESS; | |||

} | } | |||

usage(argv); | usage(argv); | |||

return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ | return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ | |||

} | } | |||

INTERNET-DRAFT expires 2002-Nov-23 | Author's Address | |||

Adam M. Costello | ||||

University of California, Berkeley | ||||

http://www.nicemice.net/amc/ | ||||

Full Copyright Statement | ||||

Copyright (C) The Internet Society (2003). All Rights Reserved. | ||||

This document and translations of it may be copied and furnished to | ||||

others, and derivative works that comment on or otherwise explain it | ||||

or assist in its implementation may be prepared, copied, published | ||||

and distributed, in whole or in part, without restriction of any | ||||

kind, provided that the above copyright notice and this paragraph are | ||||

included on all such copies and derivative works. However, this | ||||

document itself may not be modified in any way, such as by removing | ||||

the copyright notice or references to the Internet Society or other | ||||

Internet organizations, except as needed for the purpose of | ||||

developing Internet standards in which case the procedures for | ||||

copyrights defined in the Internet Standards process must be | ||||

followed, or as required to translate it into languages other than | ||||

English. | ||||

The limited permissions granted above are perpetual and will not be | ||||

revoked by the Internet Society or its successors or assigns. | ||||

This document and the information contained herein is provided on an | ||||

"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING | ||||

TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING | ||||

BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION | ||||

HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF | ||||

MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | ||||

Acknowledgement | ||||

Funding for the RFC Editor function is currently provided by the | ||||

Internet Society. | ||||

End of changes. 73 change blocks. | ||||

288 lines changed or deleted | | 281 lines changed or added | ||

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |