draft-ietf-idn-punycode-00.txt   draft-ietf-idn-punycode-01.txt 
INTERNET-DRAFT Adam M. Costello INTERNET-DRAFT Adam M. Costello
draft-ietf-idn-punycode-00.txt 2002-Jan-06 draft-ietf-idn-punycode-01.txt 2002-Feb-24
Expires 2002-Jul-06 Expires 2002-Aug-24
Punycode version 0.3.3 Punycode: An encoding of Unicode for use with IDNA
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note Task Force (IETF), its areas, and its working groups. Note
that other groups may also distribute working documents as that other groups may also distribute working documents as
Internet-Drafts. Internet-Drafts.
skipping to change at line 28 skipping to change at line 28
months and may be updated, replaced, or obsoleted by other documents months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress." reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html http://www.ietf.org/shadow.html
Distribution of this document is unlimited. Please send comments Distribution of this document is unlimited.
to the author at amc@cs.berkeley.edu, or to the idn working
group at idn@ops.ietf.org. A non-paginated (and possibly
newer) version of this specification may be available at
http://www.cs.berkeley.edu/~amc/idn/
Abstract Abstract
Punycode is a simple and efficient encoding designed for use with Punycode is a simple and efficient transfer encoding syntax designed
Internationalized Domain Names [IDN] [IDNA]. It uniquely and for use with Internationalized Domain Names in Applications [IDNA].
reversibly transforms a Unicode string [UNICODE] into an ASCII It uniquely and reversibly transforms a Unicode string [UNICODE]
string. ASCII characters in the Unicode string are represented into an ASCII string. ASCII characters in the Unicode string are
literally, and non-ASCII characters are represented by ASCII represented literally, and non-ASCII characters are represented by
characters that are allowed in hostname labels (letters, digits, ASCII characters that are allowed in host name labels (letters,
and hyphens). Bootstring is a general algorithm that allows a digits, and hyphens). Bootstring is a general algorithm that allows
string of basic code points to uniquely represent any string of code a string of basic code points to uniquely represent any string of
points drawn from a larger set. Punycode is an instance Bootstring code points drawn from a larger set. Punycode is an instance of
that uses particular parameter values appropriate for IDNA. This Bootstring that uses particular parameter values appropriate for
document specifies Bootstring and the parameter values for Punycode. IDNA. This document specifies Bootstring and the parameter values
for Punycode.
Contents Contents
1. Introduction 1. Introduction
1.1 Features
1.2 Interaction of protocol parts
2. Terminology 2. Terminology
3. Bootstring description 3. Bootstring description
3.1 Basic code point segregation 3.1 Basic code point segregation
3.2 Insertion unsort coding 3.2 Insertion unsort coding
3.3 Generalized variable-length integers 3.3 Generalized variable-length integers
3.4 Bias adaptation 3.4 Bias adaptation
4. Bootstring parameters 4. Bootstring parameters
5. Parameter values for Punycode 5. Parameter values for Punycode
6. Bootstring algorithms 6. Bootstring algorithms
6.1 Bias adaptation function 6.1 Bias adaptation function
6.2 Decoding procedure 6.2 Decoding procedure
6.3 Encoding procedure 6.3 Encoding procedure
6.4 Alternative methods for handling overflow 6.4 Alternative methods for handling overflow
7. Punycode example strings 7. Punycode examples
7.1 Sample strings
7.2 Decoding traces
7.3 Encoding traces
8. Security considerations 8. Security considerations
9. References 9. References
A. Author contact information A. Author contact information
B. Mixed-case annotation B. Mixed-case annotation
C. Disclaimer and License C. Disclaimer and license
D. Punycode sample implementation D. Punycode sample implementation
1. Introduction 1. Introduction
The IDNA draft [IDNA] describes an architecture for supporting [IDNA] describes an architecture for supporting internationalized
internationalized domain names. Labels containing non-ASCII domain names. Labels containing non-ASCII characters can be
characters can be represented by ACE labels, which begin with a represented by ACE labels, which begin with a special prefix and
special prefix and contain only ASCII characters. The remainder contain only ASCII characters. The remainder of the label after the
of the label after the prefix is a Punycode encoding of a Unicode prefix is a Punycode encoding of a Unicode string satisfying certain
string satisfying certain constraints. For the details of the constraints. For the details of the prefix and constraints, see
prefix and constraints, see [IDNA] and [NAMEPREP]. [IDNA] and [NAMEPREP].
1.1 Features
Bootstring has been designed to have the following features: Bootstring has been designed to have the following features:
* Completeness: Every extended string (sequence of arbitrary code * Completeness: Every extended string (sequence of arbitrary code
points) can be represented by a basic string (sequence of basic points) can be represented by a basic string (sequence of basic
code points). Restrictions on what strings are allowed, and on code points). Restrictions on what strings are allowed, and on
length, may be imposed by higher layers. length, can be imposed by higher layers.
* Uniqueness: There is at most one basic string that represents a * Uniqueness: There is at most one basic string that represents a
given extended string. given extended string.
* Reversibility: Any extended string mapped to a basic string can * Reversibility: Any extended string mapped to a basic string can
be recovered from that basic string. be recovered from that basic string.
* Efficient encoding: The ratio of extended string length to * Efficient encoding: The ratio of basic string length to
basic string length is small. This is important in the context extended string length is small. This is important in the
of domain names because RFC 1034 [RFC1034] restricts the length context of domain names because RFC 1034 [RFC1034] restricts the
of a domain label to 63 characters. length of a domain label to 63 characters.
* Simplicity: The encoding and decoding algorithms are reasonably * Simplicity: The encoding and decoding algorithms are reasonably
simple to implement. The goals of efficiency and simplicity are simple to implement. The goals of efficiency and simplicity are
at odds; Bootstring aims at a good balance between them. at odds; Bootstring aims at a good balance between them.
* Readability: Basic code points appearing in the extended * Readability: Basic code points appearing in the extended string
string are represented as themselves in the basic string. This are represented as themselves in the basic string (although the
comes for free because it makes the encoding more efficient on main purpose is to improve efficiency, not readability).
average.
In addition, Punycode can support an optional feature described in Punycode can also support an additional feature described in
appendix B "Mixed-case annotation". appendix B "Mixed-case annotation".
1.2 Interaction of protocol parts
Punycode is used by the IDNA protocol [IDNA] for converting domain
labels into ASCII; it is not designed for any other purpose. It is
explicitly not designed for processing arbitrary free text.
2. Terminology 2. Terminology
The key words "must", "shall", "required", "should", "recommended", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
and "may" in this document are to be interpreted as described in RFC "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
2119 [RFC2119]. document are to be interpreted as described in RFC 2119 [RFC2119].
A code point is an integral value associated with a character in a A code point is an integral value associated with a character in a
coded character set. coded character set.
As in the Unicode Standard [UNICODE], Unicode code points are As in the Unicode Standard [UNICODE], Unicode code points are
denoted by "U+" followed by four to six hexadecimal digits, while a denoted by "U+" followed by four to six hexadecimal digits, while a
range of code points is denoted by two hexadecimal numbers separated range of code points is denoted by two hexadecimal numbers separated
by "..", with no prefixes. by "..", with no prefixes.
The operators div and mod perform integer division; (x div y) is the The operators div and mod perform integer division; (x div y) is the
skipping to change at line 140 skipping to change at line 149
and remainder are always nonnegative. and remainder are always nonnegative.
The break statement jumps out of the innermost loop (as in C). The break statement jumps out of the innermost loop (as in C).
An overflow is an attempt to compute a value that exceeds the An overflow is an attempt to compute a value that exceeds the
maximum value of an integer variable. maximum value of an integer variable.
3. Bootstring description 3. Bootstring description
Bootstring represents an arbitrary sequence of code points (the Bootstring represents an arbitrary sequence of code points (the
"extended string") as a sequence of basic code points (the "extended string") as a sequence of basic code points (the "basic
"basic string"). This section describes the representation. string"). This section describes the representation. Section 6
Section 6 "Bootstring algorithms" presents the algorithms as "Bootstring algorithms" presents the algorithms as pseudocode.
pseudocode. Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
algorithms for sample inputs.
The following sections describe the four techniques used in
Bootstring. "Basic code point segregation" is a very simple
and efficient encoding for basic code points occurring in the
extended string: they are simply copied all at once. "Insertion
unsort coding" encodes the non-basic code points as deltas, and
processes the code points in numerical order rather than in order of
appearance, which typically results in smaller deltas. The deltas
are represented as "generalized variable-length integers", which use
basic code points to represent nonnegative integers. The parameters
of this integer representation are dynamically adjusted using "bias
adaptation", to improve efficiency when consecutive deltas have
similar magnitudes.
3.1 Basic code point segregation 3.1 Basic code point segregation
All basic code points appearing in the extended string are All basic code points appearing in the extended string are
represented literally at the beginning of the basic string, in their represented literally at the beginning of the basic string, in their
original order, followed by a delimiter if (and only if) the number original order, followed by a delimiter if (and only if) the number
of basic code points is nonzero. The delimiter is a particular of basic code points is nonzero. The delimiter is a particular
basic code point, which never appears in the remainder of the basic basic code point, which never appears in the remainder of the basic
string. The decoder can therefore find the end of the literal string. The decoder can therefore find the end of the literal
portion (if there is one) by scanning for the last delimiter. portion (if there is one) by scanning for the last delimiter.
3.2 Insertion unsort coding 3.2 Insertion unsort coding
The remainder of the basic string (after the last delimiter if there The remainder of the basic string (after the last delimiter if there
is one) represents a sequence of nonnegative integral deltas as is one) represents a sequence of nonnegative integral deltas as
generalized variable-length integers, described in section 3.3. The generalized variable-length integers, described in section 3.3. The
meaning of the deltas is best understood in terms of the decoder. meaning of the deltas is best understood in terms of the decoder.
The decoder builds the extended string incrementally. Initially, The decoder builds the extended string incrementally. Initially,
the extended string is a copy of the literal portion of the basic the extended string is a copy of the literal portion of the basic
string (excluding the last delimiter). Each delta causes the string (excluding the last delimiter). The decoder inserts
decoder to insert a code point into the extended string according non-basic code points, one for each delta, into the extended string,
to the following procedure. There are two state variables: a ultimately arriving at the final decoded string.
code point n, and an index i that ranges from zero (which is the
first position of the extended string) to the current length of At the heart of this process is a state machine with two state
the extended string (which refers to a potential position beyond variables: an index i and a counter n. The index i refers to
the current end). The decoder advances the state monotonically a position in the extended string; it ranges from 0 (the first
(never returning to an earlier state) by taking steps only upward. position) to the current length of the extended string (which refers
Each step increments i, except when i already equals the length to a potential position beyond the current end). If the current
of the extended string, in which case a step resets i to zero state is <n,i>, the next state is <n,i+1> if i is less than the
and increments n. For each delta (in order), the decoder takes length of the extended string, or <n+1,0> if i equals the length of
delta steps upward, then inserts the value n into the extended the extended string. In other words, each state change causes i to
string at position i, then increments i (to skip over the code increment, wrapping around to zero if necessary, and n counts the
point just inserted). (An implementation should not take each number of wrap-arounds.
step individually, but should insead use division and remainder
calculations to advance by delta steps all at once.) It is an error Notice that the state always advances monotonically (there is no
if the inserted code point is a basic code point (because basic code way for the decoder to return to an earlier state). At each state,
points must be segregated as described in section 3.1). an insertion is either performed or not performed. At most one
insertion is performed in a given state. An insertion inserts the
value of n at position i in the extended string. The deltas are
a run-length encoding of this sequence of events: they are the
lengths of the runs of non-insertion states preceeding the insertion
states. Hence, for each delta, the decoder performs delta state
changes, then an insertion, and then one more state change. (An
implementation need not perform each state change individually, but
can instead use division and remainder calculations to compute the
next insertion state directly.) It is an error if the inserted code
point is a basic code point (because basic code points were supposed
to be segregated as described in section 3.1).
The encoder's main task is to derive the sequence of deltas that The encoder's main task is to derive the sequence of deltas that
will cause the decoder to construct the desired string. It can do will cause the decoder to construct the desired string. It can do
this by repeatedly scanning the extended string for the next code this by repeatedly scanning the extended string for the next code
point that the decoder would need to insert, and counting the number point that the decoder would need to insert, and counting the number
of steps the decoder would need to take, mindful of the fact that of state changes the decoder would need to perform, mindful of the
the decoder will be stepping over only those code points that have fact that the decoder's extended string will include only those
already been inserted. Section 6.3 "Encoding procedure" gives a code points that have already been inserted. Section 6.3 "Encoding
precise algorithm. procedure" gives a precise algorithm.
3.3 Generalized variable-length integers 3.3 Generalized variable-length integers
In a conventional integer representation the base is the number of In a conventional integer representation the base is the number of
distinct symbols for digits, whose values are 0 through base-1. Let distinct symbols for digits, whose values are 0 through base-1. Let
digit_0 denote the least significant digit, digit_1 the next least digit_0 denote the least significant digit, digit_1 the next least
significant, and so on. The value represented is the sum over j of significant, and so on. The value represented is the sum over j of
digit_j * w(j), where w(j) = base^j is the weight (scale factor) digit_j * w(j), where w(j) = base^j is the weight (scale factor)
for position j. For example, in the base 8 integer 437, the digits for position j. For example, in the base 8 integer 437, the digits
are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is
7 + 3*8 + 4*64 = 287. This representation has two disadvantages: 7 + 3*8 + 4*64 = 287. This representation has two disadvantages:
First, there are multiple encodings of each value (because there First, there are multiple encodings of each value (because there
can be extra zeros in the most significant positions), which can be extra zeros in the most significant positions), which is
is inconvenient when unique encodings are required. Second, inconvenient when unique encodings are needed. Second, the integer
the integer is not self-delimiting, so if multiple integers are is not self-delimiting, so if multiple integers are concatenated the
concatenated the boundaries between them are lost. boundaries between them are lost.
The generalized variable-length representation solves these two The generalized variable-length representation solves these two
problems. The digit values are still 0 through base-1, but now problems. The digit values are still 0 through base-1, but now
the integer is self-delimiting by means of thresholds t(j), each the integer is self-delimiting by means of thresholds t(j), each
of which is in the range 0 through base-1. Exactly one digit, the of which is in the range 0 through base-1. Exactly one digit, the
most significant, satisfies digit_j < t(j). Therefore, if several most significant, satisfies digit_j < t(j). Therefore, if several
integers are concatenated, it is easy to separate them, starting integers are concatenated, it is easy to separate them, starting
with the first if they are little-endian (least significant digit with the first if they are little-endian (least significant digit
first), or starting with the last if they are big-endian (most first), or starting with the last if they are big-endian (most
significant digit first). As before, the value is the sum over j of significant digit first). As before, the value is the sum over j of
skipping to change at line 221 skipping to change at line 255
of which is in the range 0 through base-1. Exactly one digit, the of which is in the range 0 through base-1. Exactly one digit, the
most significant, satisfies digit_j < t(j). Therefore, if several most significant, satisfies digit_j < t(j). Therefore, if several
integers are concatenated, it is easy to separate them, starting integers are concatenated, it is easy to separate them, starting
with the first if they are little-endian (least significant digit with the first if they are little-endian (least significant digit
first), or starting with the last if they are big-endian (most first), or starting with the last if they are big-endian (most
significant digit first). As before, the value is the sum over j of significant digit first). As before, the value is the sum over j of
digit_j * w(j), but the weights are different: digit_j * w(j), but the weights are different:
w(0) = 1 w(0) = 1
w(j) = w(j-1) * (base - t(j-1)) for j > 0 w(j) = w(j-1) * (base - t(j-1)) for j > 0
For example, consider the little-endian sequence of base 8 digits For example, consider the little-endian sequence of base 8 digits
734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This
implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5)
= 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is = 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is
not less than 3, but 4 is less than 5, so 4 must be the last digit. not less than 3, but 4 is less than 5, so 4 is the last digit. The
The value of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is value of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251,
251, with value 2*1 + 5*6 + 1*30 = 62. Decoding this representation with value 2*1 + 5*6 + 1*30 = 62. Decoding this representation
is very similar to decoding a conventional integer: Start with a is very similar to decoding a conventional integer: Start with a
current value of N = 0 and a weight w = 1. Fetch the next digit d current value of N = 0 and a weight w = 1. Fetch the next digit d
and increase N by d * w. If d is less than the current threshold and increase N by d * w. If d is less than the current threshold
(t) then stop, otherwise increase w by a factor of (base - t), (t) then stop, otherwise increase w by a factor of (base - t),
update t for the next position, and repeat. update t for the next position, and repeat.
Encoding this representation is similar to encoding a conventional Encoding this representation is similar to encoding a conventional
integer: If N < t then output one digit for N and stop, otherwise integer: If N < t then output one digit for N and stop, otherwise
output the digit for t + ((N - t) mod (base - t)), then replace N output the digit for t + ((N - t) mod (base - t)), then replace N
with (N - t) div (base - t), update t for the next position, and with (N - t) div (base - t), update t for the next position, and
skipping to change at line 306 skipping to change at line 339
a hint about the likely size of the next delta, and so t(j) is a hint about the likely size of the next delta, and so t(j) is
set to tmax for the more significant digits starting with the one set to tmax for the more significant digits starting with the one
expected to be last, tmin for the less significant digits up through expected to be last, tmin for the less significant digits up through
the one expected to be third-last, and somewhere between tmin and the one expected to be third-last, and somewhere between tmin and
tmax for the digit expected to be second-last (balancing the hope of tmax for the digit expected to be second-last (balancing the hope of
the expected-last digit being unnecessary against the danger of it the expected-last digit being unnecessary against the danger of it
being insufficient). being insufficient).
4. Bootstring parameters 4. Bootstring parameters
Given a set of basic code points, one must be designated as Given a set of basic code points, one needs to be designated as
the delimiter. The base can be no greater than the number of the delimiter. The base cannot be greater than the number of
distinguishable basic code points remaining. The digit-values distinguishable basic code points remaining. The digit-values in
in the range 0 through base-1 must be associated with distinct the range 0 through base-1 need to be associated with distinct
non-delimiter basic code points. In some cases multiple code points non-delimiter basic code points. In some cases multiple code points
must have the same digit-value; for example, uppercase and lowercase need to have the same digit-value; for example, uppercase and
versions of the same letter must be equivalent if basic strings are lowercase versions of the same letter need to be equivalent if basic
case-insensitive. strings are case-insensitive.
The initial value of n must be no greater than the minimum non-basic The initial value of n cannot be greater than the minimum non-basic
code point that could appear in extended strings. code point that could appear in extended strings.
The remaining five parameters (tmin, tmax, skew, damp, and the The remaining five parameters (tmin, tmax, skew, damp, and the
initial value of bias) must satisfy the following constraints: initial value of bias) need to satisfy the following constraints:
0 <= tmin <= tmax <= base-1 0 <= tmin <= tmax <= base-1
skew >= 1 skew >= 1
damp >= 2 damp >= 2
initial_bias mod base <= base - tmin initial_bias mod base <= base - tmin
Provided the constraints are satisfied, these five parameters affect Provided the constraints are satisfied, these five parameters affect
efficiency but not correctness. They should be chosen empirically. efficiency but not correctness. They are best chosen empirically.
If support for mixed-case annotation is desired (see appendix B), If support for mixed-case annotation is desired (see appendix B),
make sure that the code points corresponding to 0 through tmax-1 all make sure that the code points corresponding to 0 through tmax-1 all
have both uppercase and lowercase forms. have both uppercase and lowercase forms.
5. Parameter values for Punycode 5. Parameter values for Punycode
Punycode uses the following Bootstring parameter values: Punycode uses the following Bootstring parameter values:
base = 36 base = 36
tmin = 1 tmin = 1
tmax = 26 tmax = 26
skew = 38 skew = 38
damp = 700 damp = 700
initial_bias = 72 initial_bias = 72
initial_n = 0x80 initial_n = 128 = 0x80
In Punycode, code points are Unicode code points [UNICODE], that Although the only restriction Punycode imposes on the input integers
is, integers in the range 0..10FFFF, but not D800..DFFF, which are is that they be nonnegative, these parameters are especially
reserved for use by UTF-16. The basic code points are the ASCII designed to work well with Unicode [UNICODE] code points, which
code points (0..7F), of which U+002D (-) is the delimiter, and some are integers in the range 0..10FFFF (but not D800..DFFF, which are
of the others have digit-values as follows: reserved for use by the UTF-16 encoding of Unicode). The basic code
points are the ASCII code points (0..7F), of which U+002D (-) is the
delimiter, and some of the others have digit-values as follows:
code points digit-values code points digit-values
------------ ---------------------- ------------ ----------------------
41..5A (A-Z) = 0 to 25, respectively 41..5A (A-Z) = 0 to 25, respectively
61..7A (a-z) = 0 to 25, respectively 61..7A (a-z) = 0 to 25, respectively
30..39 (0-9) = 26 to 35, respectively 30..39 (0-9) = 26 to 35, respectively
Using hyphen-minus as the delimiter implies that the encoded string Using hyphen-minus as the delimiter implies that the encoded string
can end with a hyphen-minus only if the Unicode string consists can end with a hyphen-minus only if the Unicode string consists
entirely of basic code points, but IDNA forbids such strings from entirely of basic code points, but IDNA forbids such strings from
being encoded. The encoded string can begin with a hyphen-minus, being encoded. The encoded string can begin with a hyphen-minus,
but IDNA prepends a prefix. Therefore IDNA using Punycode conforms but IDNA prepends a prefix. Therefore IDNA using Punycode conforms
to the RFC 952 recommendation that hostname labels neither begin nor to the RFC 952 rule that host name labels neither begin nor end with
end with a hyphen-minus [RFC952]. a hyphen-minus [RFC952].
A decoder must recognize the letters in both uppercase and lowercase A decoder MUST recognize the letters in both uppercase and lowercase
forms (including mixtures of both forms). An encoder should output forms (including mixtures of both forms). An encoder SHOULD output
only uppercase forms or only lowercase forms, unless it uses only uppercase forms or only lowercase forms, unless it uses
mixed-case annotation (see appendix B). mixed-case annotation (see appendix B).
Presumably most users will not manually write or type encoded Presumably most users will not manually write or type encoded
strings (as opposed to cutting and pasting them), but those who do strings (as opposed to cutting and pasting them), but those who do
will need to be alert to the potential visual ambiguity between the will need to be alert to the potential visual ambiguity between the
following sets of characters: following sets of characters:
G 6 G 6
I l 1 I l 1
skipping to change at line 390 skipping to change at line 425
Z 2 Z 2
Such ambiguities are usually resolved by context, but in a Punycode Such ambiguities are usually resolved by context, but in a Punycode
encoded string there is no context apparent to humans. encoded string there is no context apparent to humans.
6. Bootstring algorithms 6. Bootstring algorithms
Some parts of the pseudocode can be omitted if the parameters Some parts of the pseudocode can be omitted if the parameters
satisfy certain conditions (for which Punycode qualifies). These satisfy certain conditions (for which Punycode qualifies). These
parts are enclosed in {braces}, and notes immediately following the parts are enclosed in {braces}, and notes immediately following the
pseudocode explain the conditions under which they may be omitted. pseudocode explain the conditions under which they can be omitted.
Formally, code points are integers, and hence the pseudocode assumes Formally, code points are integers, and hence the pseudocode assumes
that arithmetic operations can be performed directly on code points. that arithmetic operations can be performed directly on code points.
Some actual programming languages might require explicit conversion In some programming languages, explicit conversion between code
between code points and integers. points and integers might be necessary.
6.1 Bias adaptation function 6.1 Bias adaptation function
function adapt(delta,numpoints,firsttime): function adapt(delta,numpoints,firsttime):
if firsttime then let delta = delta div damp if firsttime then let delta = delta div damp
else let delta = delta div 2 else let delta = delta div 2
let delta = delta + (delta div numpoints) let delta = delta + (delta div numpoints)
let k = 0 let k = 0
while delta > ((base - tmin) * tmax) div 2 do begin while delta > ((base - tmin) * tmax) div 2 do begin
let delta = delta div (base - tmin) let delta = delta div (base - tmin)
skipping to change at line 432 skipping to change at line 467
and copy them to output, fail on any non-basic code point and copy them to output, fail on any non-basic code point
if more than zero code points were consumed then consume one more if more than zero code points were consumed then consume one more
(which will be the last delimiter) (which will be the last delimiter)
while the input is not exhausted do begin while the input is not exhausted do begin
let oldi = i let oldi = i
let w = 1 let w = 1
for k = base to infinity in steps of base do begin for k = base to infinity in steps of base do begin
consume a code point, or fail if there was none to consume consume a code point, or fail if there was none to consume
let digit = the code point's digit-value, fail if it has none let digit = the code point's digit-value, fail if it has none
let i = i + digit * w, fail on overflow let i = i + digit * w, fail on overflow
let t = tmin if k <= bias, tmax if k >= bias + tmax, or let t = tmin if k <= bias {+ tmin}, or
k - bias otherwise tmax if k >= bias + tmax, or k - bias otherwise
if digit < t then break if digit < t then break
let w = w * (base - t), fail on overflow let w = w * (base - t), fail on overflow
end end
let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)
let n = n + i div (length(output) + 1), fail on overflow let n = n + i div (length(output) + 1), fail on overflow
let i = i mod (length(output) + 1) let i = i mod (length(output) + 1)
{if n is a basic code point then fail} {if n is a basic code point then fail}
insert n into output at position i insert n into output at position i
increment i increment i
end end
The statement enclosed in braces (checking whether n is a basic The full statement enclosed in braces (checking whether n is a basic
code point) may be omitted if initial_n exceeds all basic code code point) can be omitted if initial_n exceeds all basic code
points (which is true for Punycode), because n is never less than points (which is true for Punycode), because n is never less than
initial_n. initial_n.
In the assignment of t, where t is clamped to the range tmin through
tmax, "+ tmin" can always be omitted. This makes the clamping
calculation incorrect when bias < k < bias + tmin, but that cannot
happen because of the way bias is computed and because of the
constraints on the parameters.
Because the decoder state can only advance monotonically, and there Because the decoder state can only advance monotonically, and there
is only one representation of any delta, there is therefore only is only one representation of any delta, there is therefore only
one encoded string that can represent a given sequence of integers. one encoded string that can represent a given sequence of integers.
The only error conditions are invalid code points, unexpected The only error conditions are invalid code points, unexpected
end-of-input, overflow, and basic code points encoded using deltas end-of-input, overflow, and basic code points encoded using deltas
instead of appearing literally. If the decoder fails on these instead of appearing literally. If the decoder fails on these
errors as shown above, then it cannot produce the same output for errors as shown above, then it cannot produce the same output for
two distinct inputs, and hence it need not re-encode its output to two distinct inputs, and hence it need not re-encode its output to
verify that it matches the input. verify that it matches the input.
The assignment of t, where t is clamped to the range tmin through
tmax, does not handle the case where bias < k < bias + tmin, but
that is impossible because of the way bias is computed and because
of the constraints on the parameters.
If the programming language does not provide overflow detection, If the programming language does not provide overflow detection,
the following technique can be used. Suppose A, B, and C are the following technique can be used. Suppose A, B, and C are
representable nonnegative integers and C is nonzero. Then A + B representable nonnegative integers and C is nonzero. Then A + B
overflows if and only if B > maxint - A, and A + (B * C) overflows overflows if and only if B > maxint - A, and A + (B * C) overflows
if and only if B > (maxint - A) div C, where maxint is the greatest if and only if B > (maxint - A) div C, where maxint is the greatest
integer for which maxint + 1 cannot be represented. Refer to integer for which maxint + 1 cannot be represented. Refer to
appendix D "Punycode sample implementation" for demonstrations of appendix D "Punycode sample implementation" for demonstrations of
this technique in the C language. See also section 6.4 "Alternative this technique in the C language. See also section 6.4 "Alternative
methods for handling overflow". methods for handling overflow".
skipping to change at line 492 skipping to change at line 528
{if the input contains a non-basic code point < n then fail} {if the input contains a non-basic code point < n then fail}
while h < length(input) do begin while h < length(input) do begin
let m = the minimum {non-basic} code point >= n in the input let m = the minimum {non-basic} code point >= n in the input
let delta = delta + (m - n) * (h + 1), fail on overflow let delta = delta + (m - n) * (h + 1), fail on overflow
let n = m let n = m
for each code point c in the input (in order) do begin for each code point c in the input (in order) do begin
if c < n {or c is basic} then increment delta, fail on overflow if c < n {or c is basic} then increment delta, fail on overflow
if c == n then begin if c == n then begin
let q = delta let q = delta
for k = base to infinity in steps of base do begin for k = base to infinity in steps of base do begin
let t = tmin if k <= bias, tmax if k >= bias + tmax, or let t = tmin if k <= bias {+ tmin}, or
k - bias otherwise tmax if k >= bias + tmax, or k - bias otherwise
if q < t then break if q < t then break
output the code point for digit t + ((q - t) mod (base - t)) output the code point for digit t + ((q - t) mod (base - t))
let q = (q - t) div (base - t) let q = (q - t) div (base - t)
end end
output the code point for digit q output the code point for digit q
let bias = adapt(delta, h + 1, test h equals b?) let bias = adapt(delta, h + 1, test h equals b?)
let delta = 0 let delta = 0
increment h increment h
end end
end end
skipping to change at line 516 skipping to change at line 553
The full statement enclosed in braces (checking whether the input The full statement enclosed in braces (checking whether the input
contains a non-basic code point less than n) can be omitted if all contains a non-basic code point less than n) can be omitted if all
code points less than initial_n are basic code points (which is true code points less than initial_n are basic code points (which is true
for Punycode if code points are unsigned). for Punycode if code points are unsigned).
The brace-enclosed conditions "non-basic" and "or m is basic" can be The brace-enclosed conditions "non-basic" and "or m is basic" can be
omitted if initial_n exceeds all basic code points (which is true omitted if initial_n exceeds all basic code points (which is true
for Punycode), because the code point being tested is never less for Punycode), because the code point being tested is never less
than initial_n. than initial_n.
In the assignment of t, where t is clamped to the range tmin through
tmax, "+ tmin" can always be omitted. This makes the clamping
calculation incorrect when bias < k < bias + tmin, but that cannot
happen because of the way bias is computed and because of the
constraints on the parameters.
The checks for overflow are necessary to avoid producing invalid The checks for overflow are necessary to avoid producing invalid
output when the input contains very large values or is very long. output when the input contains very large values or is very long.
Wider integer variables can handle more extreme inputs. For IDNA, Wider integer variables can handle more extreme inputs. For IDNA,
26-bit unsigned integers are sufficient, because any string that 26-bit unsigned integers are sufficient, because any string that
required a 27-bit delta would have to exceed either the code point needed a 27-bit delta would have to exceed either the code point
limit (0..10FFFF) or the label length limit (63 characters). limit (0..10FFFF) or the label length limit (63 characters).
The increment of delta at the bottom of the outer loop cannot The increment of delta at the bottom of the outer loop cannot
overflow because delta < length(input) before the increment, and overflow because delta < length(input) before the increment, and
length(input) is already assumed to be representable. The increment length(input) is already assumed to be representable. The increment
of n could overflow, but only if h == length(input), in which case of n could overflow, but only if h == length(input), in which case
the procedure is finished anyway. the procedure is finished anyway.
6.4 Alternative methods for handling overflow 6.4 Alternative methods for handling overflow
skipping to change at line 547 skipping to change at line 590
if integer variables were capable of representing values that large. if integer variables were capable of representing values that large.
This prevention approach would impose more restrictions on the input This prevention approach would impose more restrictions on the input
than the detection approach does, but might be considered simpler in than the detection approach does, but might be considered simpler in
some programming languages. some programming languages.
In theory, the decoder could use an analogous approach, limiting the In theory, the decoder could use an analogous approach, limiting the
number of digits in a variable-length integer (that is, limiting the number of digits in a variable-length integer (that is, limiting the
number of iterations in the innermost loop). However, the number number of iterations in the innermost loop). However, the number
of digits that suffice to represent a given delta can sometimes of digits that suffice to represent a given delta can sometimes
represent much larger deltas (because of the adaptation), and hence represent much larger deltas (because of the adaptation), and hence
this approach would probably require integers wider than 32 bits. this approach would probably need integers wider than 32 bits.
Yet another approach for the decoder is to allow overflow to occur, Yet another approach for the decoder is to allow overflow to occur,
but to check the final output string by re-encoding it and comparing but to check the final output string by re-encoding it and comparing
to the decoder input. If and only if they do not match (using a to the decoder input. If and only if they do not match (using a
case-insensitive ASCII comparison) overflow has occurred. This case-insensitive ASCII comparison) overflow has occurred. This
delayed-detection approach would not impose any more restrictions on delayed-detection approach would not impose any more restrictions on
the input than the immediate-detection approach does, and might be the input than the immediate-detection approach does, and might be
considered simpler in some programming languages. considered simpler in some programming languages.
In fact, if the decoder is used only inside the IDNA ToUnicode In fact, if the decoder is used only inside the IDNA ToUnicode
operation [IDNA], then it need not check for overflow at all, operation [IDNA], then it need not check for overflow at all,
because ToUnicode performs a higher level re-encoding and because ToUnicode performs a higher level re-encoding and
comparison, and a mismatch has the same consequence as if the comparison, and a mismatch has the same consequence as if the
Punycode decoder had failed. Punycode decoder had failed.
7. Punycode example strings 7. Punycode examples
7.1 Sample strings
In the Punycode encodings below, the IDNA signature prefix is not In the Punycode encodings below, the IDNA signature prefix is not
shown. Backslashes show where line breaks have been inserted in shown. Backslashes show where line breaks have been inserted in
strings too long for one line. strings too long for one line.
The first several examples are all translations of the sentence "Why The first several examples are all translations of the sentence "Why
can't they just speak in <language>?" (courtesy of Michael Kaplan's can't they just speak in <language>?" (courtesy of Michael Kaplan's
"provincial" page [PROVINCIAL]). Word breaks and punctuation have "provincial" page [PROVINCIAL]). Word breaks and punctuation have
been removed, as is often done in domain names. been removed, as is often done in domain names.
skipping to change at line 680 skipping to change at line 725
Punycode: MajiKoi5-783gue6qz075azm5e Punycode: MajiKoi5-783gue6qz075azm5e
(Q) <pafii>de<runba> (Q) <pafii>de<runba>
u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
Punycode: de-jg4avhby1noc0d Punycode: de-jg4avhby1noc0d
(R) <sono><supiido><de> (R) <sono><supiido><de>
u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
Punycode: d9juau41awczczp Punycode: d9juau41awczczp
The last example is an ASCII string that breaks not only the The last example is an ASCII string that breaks not only the
existing rules for host name labels but also the rules proposed in existing rules for host name labels but also the rules in [NAMEPREP]
[NAMEPREP] for internationalized domain names. for internationalized domain names.
(S) -> $1.00 <- (S) -> $1.00 <-
u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
u+003C u+002D u+003C u+002D
Punycode: -> $1.00 <-- Punycode: -> $1.00 <--
7.2 Decoding traces
In the following traces, the evolving state of the decoder is
shown as a sequence of hexadecimal values, representing the code
points in the extended string. An asterisk appears just after the
most recently inserted code point, indicating both n (the value
preceeding the asterisk) and i (the position of the value just after
the asterisk). Other numerical values are decimal.
Decoding trace of example B from section 7.1:
n is 128, i is 0, bias is 72
input is "ihqwcrb4cv8a8dqg056pqjye"
there is no delimiter, so extended string starts empty
delta "ihq" decodes to 19853
bias becomes 21
4E0D *
delta "wc" decodes to 64
bias becomes 20
4E0D 4E2D *
delta "rb" decodes to 37
bias becomes 13
4E3A * 4E0D 4E2D
delta "4c" decodes to 56
bias becomes 17
4E3A 4E48 * 4E0D 4E2D
delta "v8a" decodes to 599
bias becomes 32
4E3A 4EC0 * 4E48 4E0D 4E2D
delta "8d" decodes to 130
bias becomes 23
4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
delta "qg" decodes to 154
bias becomes 25
4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
delta "056p" decodes to 46301
bias becomes 84
4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
delta "qjye" decodes to 88531
bias becomes 90
4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
Decoding trace of example L from section 7.1:
n is 128, i is 0, bias is 72
input is "3B-ww4c5e180e575a65lsy2b"
literal portion is "3B-", so extended string starts as:
0033 0042
delta "ww4c" decodes to 62042
bias becomes 27
0033 0042 5148 *
delta "5e" decodes to 139
bias becomes 24
0033 0042 516B * 5148
delta "180e" decodes to 16683
bias becomes 67
0033 5E74 * 0042 516B 5148
delta "575a" decodes to 34821
bias becomes 82
0033 5E74 0042 516B 5148 751F *
delta "65l" decodes to 14592
bias becomes 67
0033 5E74 0042 7D44 * 516B 5148 751F
delta "sy2b" decodes to 42088
bias becomes 84
0033 5E74 0042 7D44 91D1 * 516B 5148 751F
7.3 Encoding traces
In the following traces, code point values are hexadecimal, while
other numerical values are decimal.
Encoding trace of example B from section 7.1:
bias is 72
input is:
4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
there are no basic code points, so no literal portion
next code point to insert is 4E0D
needed delta is 19853, encodes as "ihq"
bias becomes 21
next code point to insert is 4E2D
needed delta is 64, encodes as "wc"
bias becomes 20
next code point to insert is 4E3A
needed delta is 37, encodes as "rb"
bias becomes 13
next code point to insert is 4E48
needed delta is 56, encodes as "4c"
bias becomes 17
next code point to insert is 4EC0
needed delta is 599, encodes as "v8a"
bias becomes 32
next code point to insert is 4ED6
needed delta is 130, encodes as "8d"
bias becomes 23
next code point to insert is 4EEC
needed delta is 154, encodes as "qg"
bias becomes 25
next code point to insert is 6587
needed delta is 46301, encodes as "056p"
bias becomes 84
next code point to insert is 8BF4
needed delta is 88531, encodes as "qjye"
bias becomes 90
output is "ihqwcrb4cv8a8dqg056pqjye"
Encoding trace of example L from section 7.1:
bias is 72
input is:
0033 5E74 0042 7D44 91D1 516B 5148 751F
basic code points (0033, 0042) are copied to literal portion: "3B-"
next code point to insert is 5148
needed delta is 62042, encodes as "ww4c"
bias becomes 27
next code point to insert is 516B
needed delta is 139, encodes as "5e"
bias becomes 24
next code point to insert is 5E74
needed delta is 16683, encodes as "180e"
bias becomes 67
next code point to insert is 751F
needed delta is 34821, encodes as "575a"
bias becomes 82
next code point to insert is 7D44
needed delta is 14592, encodes as "65l"
bias becomes 67
next code point to insert is 91D1
needed delta is 42088, encodes as "sy2b"
bias becomes 84
output is "3B-ww4c5e180e575a65lsy2b"
8. Security considerations 8. Security considerations
Users expect each domain name in DNS to be controlled by a single Users expect each domain name in DNS to be controlled by a single
authority. If a Unicode string intended for use as a domain label authority. If a Unicode string intended for use as a domain label
could map to multiple ACE labels, then an internationalized domain could map to multiple ACE labels, then an internationalized domain
name could map to multiple ASCII domain names, each controlled by name could map to multiple ASCII domain names, each controlled by
a different authority, some of which could be spoofs that hijack a different authority, some of which could be spoofs that hijack
service requests intended for another. Therefore Punycode is service requests intended for another. Therefore Punycode is
designed so that each Unicode string has a unique encoding. designed so that each Unicode string has a unique encoding.
However, there can still be multiple Unicode representations of the However, there can still be multiple Unicode representations of the
"same" text, for various definitions of "same". This problem is "same" text, for various definitions of "same". This problem is
addressed to some extent by the Unicode standard under the topic of addressed to some extent by the Unicode standard under the topic of
canonicalization, and this work is leveraged for domain names by canonicalization, and this work is leveraged for domain names by
"nameprep" [NAMEPREP]. Nameprep [NAMEPREP].
9. References 9. References
[IDN] Internationalized Domain Names (IETF working group),
http://www.i-d-n.net/, idn@ops.ietf.org.
[IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello, [IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello,
"Internationalizing Host Names In Applications (IDNA)", 2001-Nov-19, "Internationalizing Domain Names In Applications (IDNA)",
draft-ietf-idn-idna-05. 2002-###-##, draft-ietf-idn-idna-07.
[NAMEPREP] Paul Hoffman, Marc Blanchet, "Stringprep [NAMEPREP] Paul Hoffman, Marc Blanchet, "Nameprep: A Stringprep
Profile for Internationalized Host Names", 2001-Sep-27, Profile for Internationalized Domain Names", 2002-###-##,
draft-ietf-idn-nameprep-06. draft-ietf-idn-nameprep-08.
[PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
http://www.trigeminal.com/samples/provincial.html. http://www.trigeminal.com/samples/provincial.html.
[RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
Table Specification", 1985-Oct, RFC 952. Table Specification", 1985-Oct, RFC 952.
[RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
1987-Nov, RFC 1034. 1987-Nov, RFC 1034.
skipping to change at line 739 skipping to change at line 912
A. Author contact information A. Author contact information
Adam M. Costello Adam M. Costello
University of California, Berkeley University of California, Berkeley
http://www.nicemice.net/amc/ http://www.nicemice.net/amc/
B. Mixed-case annotation B. Mixed-case annotation
In order to use Punycode to represent case-insensitive strings, In order to use Punycode to represent case-insensitive strings,
higher layers need to case-fold the strings prior to Punycode higher layers need to case-fold the strings prior to Punycode
encoding. The encoded string can, however, use mixed case as an encoding. The encoded string can use mixed case as an annotation
annotation telling how to convert the original folded string into a telling how to convert the folded string into a mixed-case string
mixed-case string for display purposes. for display purposes. Note, however, that mixed-case annotation
is not used by the ToASCII and ToUnicode operations specified in
[IDNA], and therefore implementors of IDNA can disregard this
appendix.
Basic code points can use mixed case directly, because the decoder Basic code points can use mixed case directly, because the decoder
copies them verbatim, leaving lowercase code points lowercase, and copies them verbatim, leaving lowercase code points lowercase, and
leaving uppercase code points uppercase. Each non-basic code point leaving uppercase code points uppercase. Each non-basic code point
is represented by a delta, which is represented by a sequence of is represented by a delta, which is represented by a sequence of
basic code points, the last of which provides the annotation. If it basic code points, the last of which provides the annotation. If it
is uppercase, it is a suggestion to map the non-basic code point to is uppercase, it is a suggestion to map the non-basic code point to
uppercase (if possible); if it is lowercase, it is a suggestion to uppercase (if possible); if it is lowercase, it is a suggestion to
map the non-basic code point to lowercase (if possible). map the non-basic code point to lowercase (if possible).
Punycode encoders and decoders are not required to support these These annotations do not alter the code points returned by decoders;
annotations, and higher layers need not use them. the annotations are returned separately, for the caller to use or
ignore. Encoders can accept annotations in addition to code points,
but the annotations do not alter the output, except to influence the
uppercase/lowercase form of ASCII letters.
C. Disclaimer and License Punycode encoders and decoders need not support these annotations,
and higher layers need not use them.
C. Disclaimer and license
Regarding this entire document or any portion of it (including Regarding this entire document or any portion of it (including
the pseudocode and C code), the author makes no guarantees and the pseudocode and C code), the author makes no guarantees and
is not responsible for any damage resulting from its use. The is not responsible for any damage resulting from its use. The
author grants irrevocable permission to anyone to use, modify, author grants irrevocable permission to anyone to use, modify,
and distribute it in any way that does not diminish the rights and distribute it in any way that does not diminish the rights
of anyone else to use, modify, and distribute it, provided that of anyone else to use, modify, and distribute it, provided that
redistributed derivative works do not contain misleading author or redistributed derivative works do not contain misleading author or
version information. Derivative works need not be licensed under version information. Derivative works need not be licensed under
similar terms. similar terms.
D. Punycode sample implementation D. Punycode sample implementation
/* /*
punycode.c 0.4.0 (2001-Nov-17-Sat) punycode.c from draft-ietf-idn-punycode-01
http://www.cs.berkeley.edu/~amc/idn/ http://www.nicemice.net/idn/
Adam M. Costello Adam M. Costello
http://www.nicemice.net/amc/ http://www.nicemice.net/amc/
*/
/* This is ANSI C code (C89) implementing Punycode version 0.3.x. */ This is ANSI C code (C89) implementing
Punycode (draft-ietf-idn-punycode-01).
*/
/************************************************************/ /************************************************************/
/* Public interface (would normally go in its own .h file): */ /* Public interface (would normally go in its own .h file): */
#include <limits.h> #include <limits.h>
enum punycode_status { enum punycode_status {
punycode_success, punycode_success,
punycode_bad_input, /* Input is invalid. */ punycode_bad_input, /* Input is invalid. */
punycode_big_output, /* Output would exceed the space provided. */ punycode_big_output, /* Output would exceed the space provided. */
punycode_overflow /* Input requires wider integers to process. */ punycode_overflow /* Input needs wider integers to process. */
}; };
#if UINT_MAX >= (1 << 26) - 1 #if UINT_MAX >= (1 << 26) - 1
typedef unsigned int punycode_uint; typedef unsigned int punycode_uint;
#else #else
typedef unsigned long punycode_uint; typedef unsigned long punycode_uint;
#endif #endif
enum punycode_status punycode_encode( enum punycode_status punycode_encode(
punycode_uint input_length, punycode_uint input_length,
skipping to change at line 802 skipping to change at line 986
#else #else
typedef unsigned long punycode_uint; typedef unsigned long punycode_uint;
#endif #endif
enum punycode_status punycode_encode( enum punycode_status punycode_encode(
punycode_uint input_length, punycode_uint input_length,
const punycode_uint input[], const punycode_uint input[],
const unsigned char case_flags[], const unsigned char case_flags[],
punycode_uint *output_length, punycode_uint *output_length,
char output[] ); char output[] );
/* punycode_encode() converts Unicode to Punycode. The input */ /* punycode_encode() converts Unicode to Punycode. The input */
/* must be represented as an array of Unicode code points (not */ /* is represented as an array of Unicode code points (not code */
/* code units; surrogate pairs are not allowed), and the output */ /* units; surrogate pairs are not allowed), and the output */
/* will be represented as an array of ASCII code points. The */ /* will be represented as an array of ASCII code points. The */
/* output string is *not* null-terminated; it will contain */ /* output string is *not* null-terminated; it will contain */
/* zeros if and only if the input contains zeros. (Of course */ /* zeros if and only if the input contains zeros. (Of course */
/* the caller can leave room for a terminator and add one if */ /* the caller can leave room for a terminator and add one if */
/* needed.) The input_length is the number of code points in the */ /* needed.) The input_length is the number of code points in */
/* input. The output_length is an in/out argument: the caller */ /* the input. The output_length is an in/out argument: the */
/* must pass in the maximum number of code points that may be */ /* caller passes in the maximum number of code points that it */
/* output, and on successful return it will contain the number */ /* can receive, and on successful return it will contain the */
/* of code points actually output. The case_flags array must */ /* number of code points actually output. The case_flags array */
/* hold input_length boolean values, where nonzero means the */ /* holds input_length boolean values, where nonzero suggests that */
/* corresponding Unicode character should be forced to uppercase */ /* the corresponding Unicode character be forced to uppercase */
/* after being decoded (if possible), and zero means it should */ /* after being decoded (if possible), and zero suggests that */
/* be forced to lowercase (if possible). ASCII code points */ /* it be forced to lowercase (if possible). ASCII code points */
/* are encoded literally, except that ASCII letters are forced */ /* are encoded literally, except that ASCII letters are forced */
/* to uppercase or lowercase according to the corresponding */ /* to uppercase or lowercase according to the corresponding */
/* uppercase flags. If case_flags is a null pointer then ASCII */ /* uppercase flags. If case_flags is a null pointer then ASCII */
/* letters are left as they are, and other code points are */ /* letters are left as they are, and other code points are */
/* treated as if their uppercase flags were zero. The return */ /* treated as if their uppercase flags were zero. The return */
/* value may be any of the punycode_status values defined above */ /* value can be any of the punycode_status values defined above */
/* except punycode_bad_input; if not punycode_success, then */ /* except punycode_bad_input; if not punycode_success, then */
/* output_size and output may contain garbage. */ /* output_size and output might contain garbage. */
enum punycode_status punycode_decode( enum punycode_status punycode_decode(
punycode_uint input_length, punycode_uint input_length,
const char input[], const char input[],
punycode_uint *output_length, punycode_uint *output_length,
punycode_uint output[], punycode_uint output[],
unsigned char case_flags[] ); unsigned char case_flags[] );
/* punycode_decode() converts Punycode to Unicode. The input */
/* must be represented as an array of ASCII code points, and */ /* punycode_decode() converts Punycode to Unicode. The input is */
/* the output will be represented as an array of Unicode code */ /* represented as an array of ASCII code points, and the output */
/* points. The input_length is the number of code points in */ /* will be represented as an array of Unicode code points. The */
/* the input. The output_length is an in/out argument: the */ /* input_length is the number of code points in the input. The */
/* caller must pass in the maximum number of code points that */ /* output_length is an in/out argument: the caller passes in */
/* may be output, and on successful return it will contain the */ /* the maximum number of code points that it can receive, and */
/* actual number of code points output. The case_flags array */ /* on successful return it will contain the actual number of */
/* must have room for at least output_length values, or it may */ /* code points output. The case_flags array needs room for at */
/* be a null pointer if the case information is not needed. */ /* least output_length values, or it can be a null pointer if the */
/* A nonzero flag indicates that the corresponding Unicode */ /* case information is not needed. A nonzero flag suggests that */
/* character should be forced to uppercase by the caller (if */ /* the corresponding Unicode character be forced to uppercase */
/* possible), while zero means it should be forced to lowercase */ /* by the caller (if possible), while zero suggests that it be */
/* (if possible). ASCII code points are output already in the */ /* forced to lowercase (if possible). ASCII code points are */
/* proper case, but their flags will be set appropriately so that */ /* output already in the proper case, but their flags will be set */
/* applying the flags would be harmless. The return value may */ /* appropriately so that applying the flags would be harmless. */
/* be any of the punycode_status values defined above; if not */ /* The return value can be any of the punycode_status values */
/* punycode_success, then output_length, output, and case_flags */ /* defined above; if not punycode_success, then output_length, */
/* may contain garbage. On success, the decoder will never need */ /* output, and case_flags might contain garbage. On success, the */
/* to write an output_length greater than input_length, because */ /* decoder will never need to write an output_length greater than */
/* of how the encoding is defined. */ /* input_length, because of how the encoding is defined. */
/**********************************************************/ /**********************************************************/
/* Implementation (would normally go in its own .c file): */ /* Implementation (would normally go in its own .c file): */
#include <string.h> #include <string.h>
/*** Bootstring parameters for Punycode ***/ /*** Bootstring parameters for Punycode ***/
enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700, enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
initial_bias = 72, initial_n = 0x80, delimiter = 0x2D }; initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
skipping to change at line 883 skipping to change at line 1066
/* point (for use in representing integers) in the range 0 to */ /* point (for use in representing integers) in the range 0 to */
/* base-1, or base if cp is does not represent a value. */ /* base-1, or base if cp is does not represent a value. */
static punycode_uint decode_digit(punycode_uint cp) static punycode_uint decode_digit(punycode_uint cp)
{ {
return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 : return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 :
cp - 97 < 26 ? cp - 97 : base; cp - 97 < 26 ? cp - 97 : base;
} }
/* encode_digit(d,flag) returns the basic code point whose value */ /* encode_digit(d,flag) returns the basic code point whose value */
/* (when used for representing integers) is d, which must be in the */ /* (when used for representing integers) is d, which needs to be in */
/* range 0 to base-1. The lowercase form is used unless flag is */ /* the range 0 to base-1. The lowercase form is used unless flag is */
/* nonzero, in which case the uppercase form is used. The behavior */ /* nonzero, in which case the uppercase form is used. The behavior */
/* is undefined if flag is nonzero and digit d has no uppercase form. */ /* is undefined if flag is nonzero and digit d has no uppercase form. */
static char encode_digit(punycode_uint d, int flag) static char encode_digit(punycode_uint d, int flag)
{ {
return d + 22 + 75 * (d < 26) - ((flag != 0) << 5); return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
/* 0..25 map to ASCII a..z or A..Z */ /* 0..25 map to ASCII a..z or A..Z */
/* 26..35 map to ASCII 0..9 */ /* 26..35 map to ASCII 0..9 */
} }
skipping to change at line 1003 skipping to change at line 1187
/* Punycode does not need to check whether input[j] is basic: */ /* Punycode does not need to check whether input[j] is basic: */
if (input[j] < n /* || basic(input[j]) */ ) { if (input[j] < n /* || basic(input[j]) */ ) {
if (++delta == 0) return punycode_overflow; if (++delta == 0) return punycode_overflow;
} }
if (input[j] == n) { if (input[j] == n) {
/* Represent delta as a generalized variable-length integer: */ /* Represent delta as a generalized variable-length integer: */
for (q = delta, k = base; ; k += base) { for (q = delta, k = base; ; k += base) {
if (out >= max_out) return punycode_big_output; if (out >= max_out) return punycode_big_output;
t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
k >= bias + tmax ? tmax : k - bias;
if (q < t) break; if (q < t) break;
output[out++] = encode_digit(t + (q - t) % (base - t), 0); output[out++] = encode_digit(t + (q - t) % (base - t), 0);
q = (q - t) / (base - t); q = (q - t) / (base - t);
} }
output[out++] = encode_digit(q, case_flags && case_flags[j]); output[out++] = encode_digit(q, case_flags && case_flags[j]);
bias = adapt(delta, h + 1, h == b); bias = adapt(delta, h + 1, h == b);
delta = 0; delta = 0;
++h; ++h;
} }
skipping to change at line 1073 skipping to change at line 1259
/* which gets added to i. The overflow checking is easier */ /* which gets added to i. The overflow checking is easier */
/* if we increase i as we go, then subtract off its starting */ /* if we increase i as we go, then subtract off its starting */
/* value at the end to obtain delta. */ /* value at the end to obtain delta. */
for (oldi = i, w = 1, k = base; ; k += base) { for (oldi = i, w = 1, k = base; ; k += base) {
if (in >= input_length) return punycode_bad_input; if (in >= input_length) return punycode_bad_input;
digit = decode_digit(input[in++]); digit = decode_digit(input[in++]);
if (digit >= base) return punycode_bad_input; if (digit >= base) return punycode_bad_input;
if (digit > (maxint - i) / w) return punycode_overflow; if (digit > (maxint - i) / w) return punycode_overflow;
i += digit * w; i += digit * w;
t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
k >= bias + tmax ? tmax : k - bias;
if (digit < t) break; if (digit < t) break;
if (w > maxint / (base - t)) return punycode_overflow; if (w > maxint / (base - t)) return punycode_overflow;
w *= (base - t); w *= (base - t);
} }
bias = adapt(i - oldi, out + 1, oldi == 0); bias = adapt(i - oldi, out + 1, oldi == 0);
/* i was supposed to wrap around from out+1 to 0, */ /* i was supposed to wrap around from out+1 to 0, */
/* incrementing n each time, so we'll fix that now: */ /* incrementing n each time, so we'll fix that now: */
if (i / (out + 1) > maxint - n) return punycode_overflow; if (i / (out + 1) > maxint - n) return punycode_overflow;
n += i / (out + 1); n += i / (out + 1);
i %= (out + 1); i %= (out + 1);
/* Insert n at position i of the output: */ /* Insert n at position i of the output: */
skipping to change at line 1135 skipping to change at line 1322
{ {
fprintf(stderr, fprintf(stderr,
"\n" "\n"
"%s -e reads code points and writes a Punycode string.\n" "%s -e reads code points and writes a Punycode string.\n"
"%s -d reads a Punycode string and writes code points.\n" "%s -d reads a Punycode string and writes code points.\n"
"\n" "\n"
"Input and output are plain text in the native character set.\n" "Input and output are plain text in the native character set.\n"
"Code points are in the form u+hex separated by whitespace.\n" "Code points are in the form u+hex separated by whitespace.\n"
"Although the specification allows Punycode strings to contain\n" "Although the specification allows Punycode strings to contain\n"
"any characters from the ASCII repertoire, this test code\n" "any characters from the ASCII repertoire, this test code\n"
"supports only the printable characters, and requires the\n" "supports only the printable characters, and needs the Punycode\n"
"Punycode string to be followed by a newline.\n" "string to be followed by a newline.\n"
"The case of the u in u+hex is the force-to-uppercase flag.\n" "The case of the u in u+hex is the force-to-uppercase flag.\n"
, argv[0], argv[0]); , argv[0], argv[0]);
exit(EXIT_FAILURE); exit(EXIT_FAILURE);
} }
static void fail(const char *msg) static void fail(const char *msg)
{ {
fputs(msg,stderr); fputs(msg,stderr);
exit(EXIT_FAILURE); exit(EXIT_FAILURE);
} }
skipping to change at line 1274 skipping to change at line 1461
if (r < 0) fail(io_error); if (r < 0) fail(io_error);
} }
return EXIT_SUCCESS; return EXIT_SUCCESS;
} }
usage(argv); usage(argv);
return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ return EXIT_SUCCESS; /* not reached, but quiets compiler warning */
} }
INTERNET-DRAFT expires 2002-Jul-06 INTERNET-DRAFT expires 2002-Aug-24
 End of changes. 65 change blocks. 
176 lines changed or deleted 359 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/