 1/draftietfidnpunycode00.txt 20061203 11:34:43.000000000 +0100
+++ 2/draftietfidnpunycode01.txt 20061203 11:34:43.000000000 +0100
@@ 1,15 +1,15 @@
INTERNETDRAFT Adam M. Costello
draftietfidnpunycode00.txt 2002Jan06
Expires 2002Jul06
+draftietfidnpunycode01.txt 2002Feb24
+Expires 2002Aug24
 Punycode version 0.3.3
+ Punycode: An encoding of Unicode for use with IDNA
Status of this Memo
This document is an InternetDraft and is in full conformance with
all provisions of Section 10 of RFC2026.
InternetDrafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note
that other groups may also distribute working documents as
InternetDrafts.
@@ 18,109 +18,118 @@
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use InternetDrafts as
reference material or to cite them other than as "work in progress."
The list of current InternetDrafts can be accessed at
http://www.ietf.org/ietf/1idabstracts.txt
The list of InternetDraft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
 Distribution of this document is unlimited. Please send comments
 to the author at amc@cs.berkeley.edu, or to the idn working
 group at idn@ops.ietf.org. A nonpaginated (and possibly
 newer) version of this specification may be available at
 http://www.cs.berkeley.edu/~amc/idn/
+ Distribution of this document is unlimited.
Abstract
 Punycode is a simple and efficient encoding designed for use with
 Internationalized Domain Names [IDN] [IDNA]. It uniquely and
 reversibly transforms a Unicode string [UNICODE] into an ASCII
 string. ASCII characters in the Unicode string are represented
 literally, and nonASCII characters are represented by ASCII
 characters that are allowed in hostname labels (letters, digits,
 and hyphens). Bootstring is a general algorithm that allows a
 string of basic code points to uniquely represent any string of code
 points drawn from a larger set. Punycode is an instance Bootstring
 that uses particular parameter values appropriate for IDNA. This
 document specifies Bootstring and the parameter values for Punycode.
+ Punycode is a simple and efficient transfer encoding syntax designed
+ for use with Internationalized Domain Names in Applications [IDNA].
+ It uniquely and reversibly transforms a Unicode string [UNICODE]
+ into an ASCII string. ASCII characters in the Unicode string are
+ represented literally, and nonASCII characters are represented by
+ ASCII characters that are allowed in host name labels (letters,
+ digits, and hyphens). Bootstring is a general algorithm that allows
+ a string of basic code points to uniquely represent any string of
+ code points drawn from a larger set. Punycode is an instance of
+ Bootstring that uses particular parameter values appropriate for
+ IDNA. This document specifies Bootstring and the parameter values
+ for Punycode.
Contents
1. Introduction
+ 1.1 Features
+ 1.2 Interaction of protocol parts
2. Terminology
3. Bootstring description
3.1 Basic code point segregation
3.2 Insertion unsort coding
3.3 Generalized variablelength integers
3.4 Bias adaptation
4. Bootstring parameters
5. Parameter values for Punycode
6. Bootstring algorithms
6.1 Bias adaptation function
6.2 Decoding procedure
6.3 Encoding procedure
6.4 Alternative methods for handling overflow
 7. Punycode example strings
+ 7. Punycode examples
+ 7.1 Sample strings
+ 7.2 Decoding traces
+ 7.3 Encoding traces
8. Security considerations
9. References
A. Author contact information
B. Mixedcase annotation
 C. Disclaimer and License
+ C. Disclaimer and license
D. Punycode sample implementation
1. Introduction
 The IDNA draft [IDNA] describes an architecture for supporting
 internationalized domain names. Labels containing nonASCII
 characters can be represented by ACE labels, which begin with a
 special prefix and contain only ASCII characters. The remainder
 of the label after the prefix is a Punycode encoding of a Unicode
 string satisfying certain constraints. For the details of the
 prefix and constraints, see [IDNA] and [NAMEPREP].
+ [IDNA] describes an architecture for supporting internationalized
+ domain names. Labels containing nonASCII characters can be
+ represented by ACE labels, which begin with a special prefix and
+ contain only ASCII characters. The remainder of the label after the
+ prefix is a Punycode encoding of a Unicode string satisfying certain
+ constraints. For the details of the prefix and constraints, see
+ [IDNA] and [NAMEPREP].
+
+1.1 Features
Bootstring has been designed to have the following features:
* Completeness: Every extended string (sequence of arbitrary code
points) can be represented by a basic string (sequence of basic
code points). Restrictions on what strings are allowed, and on
 length, may be imposed by higher layers.
+ length, can be imposed by higher layers.
* Uniqueness: There is at most one basic string that represents a
given extended string.
* Reversibility: Any extended string mapped to a basic string can
be recovered from that basic string.
 * Efficient encoding: The ratio of extended string length to
 basic string length is small. This is important in the context
 of domain names because RFC 1034 [RFC1034] restricts the length
 of a domain label to 63 characters.
+ * Efficient encoding: The ratio of basic string length to
+ extended string length is small. This is important in the
+ context of domain names because RFC 1034 [RFC1034] restricts the
+ length of a domain label to 63 characters.
* Simplicity: The encoding and decoding algorithms are reasonably
simple to implement. The goals of efficiency and simplicity are
at odds; Bootstring aims at a good balance between them.
 * Readability: Basic code points appearing in the extended
 string are represented as themselves in the basic string. This
 comes for free because it makes the encoding more efficient on
 average.
+ * Readability: Basic code points appearing in the extended string
+ are represented as themselves in the basic string (although the
+ main purpose is to improve efficiency, not readability).
 In addition, Punycode can support an optional feature described in
+ Punycode can also support an additional feature described in
appendix B "Mixedcase annotation".
+1.2 Interaction of protocol parts
+
+ Punycode is used by the IDNA protocol [IDNA] for converting domain
+ labels into ASCII; it is not designed for any other purpose. It is
+ explicitly not designed for processing arbitrary free text.
+
2. Terminology
 The key words "must", "shall", "required", "should", "recommended",
 and "may" in this document are to be interpreted as described in RFC
 2119 [RFC2119].
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+ document are to be interpreted as described in RFC 2119 [RFC2119].
A code point is an integral value associated with a character in a
coded character set.
As in the Unicode Standard [UNICODE], Unicode code points are
denoted by "U+" followed by four to six hexadecimal digits, while a
range of code points is denoted by two hexadecimal numbers separated
by "..", with no prefixes.
The operators div and mod perform integer division; (x div y) is the
@@ 130,87 +139,112 @@
and remainder are always nonnegative.
The break statement jumps out of the innermost loop (as in C).
An overflow is an attempt to compute a value that exceeds the
maximum value of an integer variable.
3. Bootstring description
Bootstring represents an arbitrary sequence of code points (the
 "extended string") as a sequence of basic code points (the
 "basic string"). This section describes the representation.
 Section 6 "Bootstring algorithms" presents the algorithms as
 pseudocode.
+ "extended string") as a sequence of basic code points (the "basic
+ string"). This section describes the representation. Section 6
+ "Bootstring algorithms" presents the algorithms as pseudocode.
+ Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
+ algorithms for sample inputs.
+
+ The following sections describe the four techniques used in
+ Bootstring. "Basic code point segregation" is a very simple
+ and efficient encoding for basic code points occurring in the
+ extended string: they are simply copied all at once. "Insertion
+ unsort coding" encodes the nonbasic code points as deltas, and
+ processes the code points in numerical order rather than in order of
+ appearance, which typically results in smaller deltas. The deltas
+ are represented as "generalized variablelength integers", which use
+ basic code points to represent nonnegative integers. The parameters
+ of this integer representation are dynamically adjusted using "bias
+ adaptation", to improve efficiency when consecutive deltas have
+ similar magnitudes.
3.1 Basic code point segregation
All basic code points appearing in the extended string are
represented literally at the beginning of the basic string, in their
original order, followed by a delimiter if (and only if) the number
of basic code points is nonzero. The delimiter is a particular
basic code point, which never appears in the remainder of the basic
string. The decoder can therefore find the end of the literal
portion (if there is one) by scanning for the last delimiter.
3.2 Insertion unsort coding
The remainder of the basic string (after the last delimiter if there
is one) represents a sequence of nonnegative integral deltas as
generalized variablelength integers, described in section 3.3. The
meaning of the deltas is best understood in terms of the decoder.
The decoder builds the extended string incrementally. Initially,
the extended string is a copy of the literal portion of the basic
 string (excluding the last delimiter). Each delta causes the
 decoder to insert a code point into the extended string according
 to the following procedure. There are two state variables: a
 code point n, and an index i that ranges from zero (which is the
 first position of the extended string) to the current length of
 the extended string (which refers to a potential position beyond
 the current end). The decoder advances the state monotonically
 (never returning to an earlier state) by taking steps only upward.
 Each step increments i, except when i already equals the length
 of the extended string, in which case a step resets i to zero
 and increments n. For each delta (in order), the decoder takes
 delta steps upward, then inserts the value n into the extended
 string at position i, then increments i (to skip over the code
 point just inserted). (An implementation should not take each
 step individually, but should insead use division and remainder
 calculations to advance by delta steps all at once.) It is an error
 if the inserted code point is a basic code point (because basic code
 points must be segregated as described in section 3.1).
+ string (excluding the last delimiter). The decoder inserts
+ nonbasic code points, one for each delta, into the extended string,
+ ultimately arriving at the final decoded string.
+
+ At the heart of this process is a state machine with two state
+ variables: an index i and a counter n. The index i refers to
+ a position in the extended string; it ranges from 0 (the first
+ position) to the current length of the extended string (which refers
+ to a potential position beyond the current end). If the current
+ state is , the next state is if i is less than the
+ length of the extended string, or if i equals the length of
+ the extended string. In other words, each state change causes i to
+ increment, wrapping around to zero if necessary, and n counts the
+ number of wraparounds.
+
+ Notice that the state always advances monotonically (there is no
+ way for the decoder to return to an earlier state). At each state,
+ an insertion is either performed or not performed. At most one
+ insertion is performed in a given state. An insertion inserts the
+ value of n at position i in the extended string. The deltas are
+ a runlength encoding of this sequence of events: they are the
+ lengths of the runs of noninsertion states preceeding the insertion
+ states. Hence, for each delta, the decoder performs delta state
+ changes, then an insertion, and then one more state change. (An
+ implementation need not perform each state change individually, but
+ can instead use division and remainder calculations to compute the
+ next insertion state directly.) It is an error if the inserted code
+ point is a basic code point (because basic code points were supposed
+ to be segregated as described in section 3.1).
The encoder's main task is to derive the sequence of deltas that
will cause the decoder to construct the desired string. It can do
this by repeatedly scanning the extended string for the next code
point that the decoder would need to insert, and counting the number
 of steps the decoder would need to take, mindful of the fact that
 the decoder will be stepping over only those code points that have
 already been inserted. Section 6.3 "Encoding procedure" gives a
 precise algorithm.
+ of state changes the decoder would need to perform, mindful of the
+ fact that the decoder's extended string will include only those
+ code points that have already been inserted. Section 6.3 "Encoding
+ procedure" gives a precise algorithm.
3.3 Generalized variablelength integers
In a conventional integer representation the base is the number of
distinct symbols for digits, whose values are 0 through base1. Let
digit_0 denote the least significant digit, digit_1 the next least
significant, and so on. The value represented is the sum over j of
digit_j * w(j), where w(j) = base^j is the weight (scale factor)
for position j. For example, in the base 8 integer 437, the digits
are 7, 3, and 4, and the weights are 1, 8, and 64, so the value is
7 + 3*8 + 4*64 = 287. This representation has two disadvantages:
First, there are multiple encodings of each value (because there
 can be extra zeros in the most significant positions), which
 is inconvenient when unique encodings are required. Second,
 the integer is not selfdelimiting, so if multiple integers are
 concatenated the boundaries between them are lost.
+ can be extra zeros in the most significant positions), which is
+ inconvenient when unique encodings are needed. Second, the integer
+ is not selfdelimiting, so if multiple integers are concatenated the
+ boundaries between them are lost.
The generalized variablelength representation solves these two
problems. The digit values are still 0 through base1, but now
the integer is selfdelimiting by means of thresholds t(j), each
of which is in the range 0 through base1. Exactly one digit, the
most significant, satisfies digit_j < t(j). Therefore, if several
integers are concatenated, it is easy to separate them, starting
with the first if they are littleendian (least significant digit
first), or starting with the last if they are bigendian (most
significant digit first). As before, the value is the sum over j of
@@ 211,28 +245,27 @@
of which is in the range 0 through base1. Exactly one digit, the
most significant, satisfies digit_j < t(j). Therefore, if several
integers are concatenated, it is easy to separate them, starting
with the first if they are littleendian (least significant digit
first), or starting with the last if they are bigendian (most
significant digit first). As before, the value is the sum over j of
digit_j * w(j), but the weights are different:
w(0) = 1
w(j) = w(j1) * (base  t(j1)) for j > 0

For example, consider the littleendian sequence of base 8 digits
734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This
implies that the weights are 1, 1*(82) = 6, 6*(83) = 30, 30*(85)
= 90, 90*(85) = 270, and so on. 7 is not less than 2, and 3 is
 not less than 3, but 4 is less than 5, so 4 must be the last digit.
 The value of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is
 251, with value 2*1 + 5*6 + 1*30 = 62. Decoding this representation
+ not less than 3, but 4 is less than 5, so 4 is the last digit. The
+ value of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251,
+ with value 2*1 + 5*6 + 1*30 = 62. Decoding this representation
is very similar to decoding a conventional integer: Start with a
current value of N = 0 and a weight w = 1. Fetch the next digit d
and increase N by d * w. If d is less than the current threshold
(t) then stop, otherwise increase w by a factor of (base  t),
update t for the next position, and repeat.
Encoding this representation is similar to encoding a conventional
integer: If N < t then output one digit for N and stop, otherwise
output the digit for t + ((N  t) mod (base  t)), then replace N
with (N  t) div (base  t), update t for the next position, and
@@ 296,81 +329,83 @@
a hint about the likely size of the next delta, and so t(j) is
set to tmax for the more significant digits starting with the one
expected to be last, tmin for the less significant digits up through
the one expected to be thirdlast, and somewhere between tmin and
tmax for the digit expected to be secondlast (balancing the hope of
the expectedlast digit being unnecessary against the danger of it
being insufficient).
4. Bootstring parameters
 Given a set of basic code points, one must be designated as
 the delimiter. The base can be no greater than the number of
 distinguishable basic code points remaining. The digitvalues
 in the range 0 through base1 must be associated with distinct
+ Given a set of basic code points, one needs to be designated as
+ the delimiter. The base cannot be greater than the number of
+ distinguishable basic code points remaining. The digitvalues in
+ the range 0 through base1 need to be associated with distinct
nondelimiter basic code points. In some cases multiple code points
 must have the same digitvalue; for example, uppercase and lowercase
 versions of the same letter must be equivalent if basic strings are
 caseinsensitive.
+ need to have the same digitvalue; for example, uppercase and
+ lowercase versions of the same letter need to be equivalent if basic
+ strings are caseinsensitive.
 The initial value of n must be no greater than the minimum nonbasic
+ The initial value of n cannot be greater than the minimum nonbasic
code point that could appear in extended strings.
The remaining five parameters (tmin, tmax, skew, damp, and the
 initial value of bias) must satisfy the following constraints:
+ initial value of bias) need to satisfy the following constraints:
0 <= tmin <= tmax <= base1
skew >= 1
damp >= 2
initial_bias mod base <= base  tmin
Provided the constraints are satisfied, these five parameters affect
 efficiency but not correctness. They should be chosen empirically.
+ efficiency but not correctness. They are best chosen empirically.
If support for mixedcase annotation is desired (see appendix B),
make sure that the code points corresponding to 0 through tmax1 all
have both uppercase and lowercase forms.
5. Parameter values for Punycode
Punycode uses the following Bootstring parameter values:
base = 36
tmin = 1
tmax = 26
skew = 38
damp = 700
initial_bias = 72
 initial_n = 0x80
+ initial_n = 128 = 0x80
 In Punycode, code points are Unicode code points [UNICODE], that
 is, integers in the range 0..10FFFF, but not D800..DFFF, which are
 reserved for use by UTF16. The basic code points are the ASCII
 code points (0..7F), of which U+002D () is the delimiter, and some
 of the others have digitvalues as follows:
+ Although the only restriction Punycode imposes on the input integers
+ is that they be nonnegative, these parameters are especially
+ designed to work well with Unicode [UNICODE] code points, which
+ are integers in the range 0..10FFFF (but not D800..DFFF, which are
+ reserved for use by the UTF16 encoding of Unicode). The basic code
+ points are the ASCII code points (0..7F), of which U+002D () is the
+ delimiter, and some of the others have digitvalues as follows:
code points digitvalues
 
41..5A (AZ) = 0 to 25, respectively
61..7A (az) = 0 to 25, respectively
30..39 (09) = 26 to 35, respectively
Using hyphenminus as the delimiter implies that the encoded string
can end with a hyphenminus only if the Unicode string consists
entirely of basic code points, but IDNA forbids such strings from
being encoded. The encoded string can begin with a hyphenminus,
but IDNA prepends a prefix. Therefore IDNA using Punycode conforms
 to the RFC 952 recommendation that hostname labels neither begin nor
 end with a hyphenminus [RFC952].
+ to the RFC 952 rule that host name labels neither begin nor end with
+ a hyphenminus [RFC952].
 A decoder must recognize the letters in both uppercase and lowercase
 forms (including mixtures of both forms). An encoder should output
+ A decoder MUST recognize the letters in both uppercase and lowercase
+ forms (including mixtures of both forms). An encoder SHOULD output
only uppercase forms or only lowercase forms, unless it uses
mixedcase annotation (see appendix B).
Presumably most users will not manually write or type encoded
strings (as opposed to cutting and pasting them), but those who do
will need to be alert to the potential visual ambiguity between the
following sets of characters:
G 6
I l 1
@@ 380,26 +415,26 @@
Z 2
Such ambiguities are usually resolved by context, but in a Punycode
encoded string there is no context apparent to humans.
6. Bootstring algorithms
Some parts of the pseudocode can be omitted if the parameters
satisfy certain conditions (for which Punycode qualifies). These
parts are enclosed in {braces}, and notes immediately following the
 pseudocode explain the conditions under which they may be omitted.
+ pseudocode explain the conditions under which they can be omitted.
Formally, code points are integers, and hence the pseudocode assumes
that arithmetic operations can be performed directly on code points.
 Some actual programming languages might require explicit conversion
 between code points and integers.
+ In some programming languages, explicit conversion between code
+ points and integers might be necessary.
6.1 Bias adaptation function
function adapt(delta,numpoints,firsttime):
if firsttime then let delta = delta div damp
else let delta = delta div 2
let delta = delta + (delta div numpoints)
let k = 0
while delta > ((base  tmin) * tmax) div 2 do begin
let delta = delta div (base  tmin)
@@ 422,53 +457,54 @@
and copy them to output, fail on any nonbasic code point
if more than zero code points were consumed then consume one more
(which will be the last delimiter)
while the input is not exhausted do begin
let oldi = i
let w = 1
for k = base to infinity in steps of base do begin
consume a code point, or fail if there was none to consume
let digit = the code point's digitvalue, fail if it has none
let i = i + digit * w, fail on overflow
 let t = tmin if k <= bias, tmax if k >= bias + tmax, or
 k  bias otherwise
+ let t = tmin if k <= bias {+ tmin}, or
+ tmax if k >= bias + tmax, or k  bias otherwise
if digit < t then break
let w = w * (base  t), fail on overflow
end
let bias = adapt(i  oldi, length(output) + 1, test oldi is 0?)
let n = n + i div (length(output) + 1), fail on overflow
let i = i mod (length(output) + 1)
{if n is a basic code point then fail}
insert n into output at position i
increment i
end
 The statement enclosed in braces (checking whether n is a basic
 code point) may be omitted if initial_n exceeds all basic code
+ The full statement enclosed in braces (checking whether n is a basic
+ code point) can be omitted if initial_n exceeds all basic code
points (which is true for Punycode), because n is never less than
initial_n.
+ In the assignment of t, where t is clamped to the range tmin through
+ tmax, "+ tmin" can always be omitted. This makes the clamping
+ calculation incorrect when bias < k < bias + tmin, but that cannot
+ happen because of the way bias is computed and because of the
+ constraints on the parameters.
+
Because the decoder state can only advance monotonically, and there
is only one representation of any delta, there is therefore only
one encoded string that can represent a given sequence of integers.
The only error conditions are invalid code points, unexpected
endofinput, overflow, and basic code points encoded using deltas
instead of appearing literally. If the decoder fails on these
errors as shown above, then it cannot produce the same output for
two distinct inputs, and hence it need not reencode its output to
verify that it matches the input.
 The assignment of t, where t is clamped to the range tmin through
 tmax, does not handle the case where bias < k < bias + tmin, but
 that is impossible because of the way bias is computed and because
 of the constraints on the parameters.

If the programming language does not provide overflow detection,
the following technique can be used. Suppose A, B, and C are
representable nonnegative integers and C is nonzero. Then A + B
overflows if and only if B > maxint  A, and A + (B * C) overflows
if and only if B > (maxint  A) div C, where maxint is the greatest
integer for which maxint + 1 cannot be represented. Refer to
appendix D "Punycode sample implementation" for demonstrations of
this technique in the C language. See also section 6.4 "Alternative
methods for handling overflow".
@@ 482,22 +518,22 @@
{if the input contains a nonbasic code point < n then fail}
while h < length(input) do begin
let m = the minimum {nonbasic} code point >= n in the input
let delta = delta + (m  n) * (h + 1), fail on overflow
let n = m
for each code point c in the input (in order) do begin
if c < n {or c is basic} then increment delta, fail on overflow
if c == n then begin
let q = delta
for k = base to infinity in steps of base do begin
 let t = tmin if k <= bias, tmax if k >= bias + tmax, or
 k  bias otherwise
+ let t = tmin if k <= bias {+ tmin}, or
+ tmax if k >= bias + tmax, or k  bias otherwise
if q < t then break
output the code point for digit t + ((q  t) mod (base  t))
let q = (q  t) div (base  t)
end
output the code point for digit q
let bias = adapt(delta, h + 1, test h equals b?)
let delta = 0
increment h
end
end
@@ 506,25 +543,31 @@
The full statement enclosed in braces (checking whether the input
contains a nonbasic code point less than n) can be omitted if all
code points less than initial_n are basic code points (which is true
for Punycode if code points are unsigned).
The braceenclosed conditions "nonbasic" and "or m is basic" can be
omitted if initial_n exceeds all basic code points (which is true
for Punycode), because the code point being tested is never less
than initial_n.
+ In the assignment of t, where t is clamped to the range tmin through
+ tmax, "+ tmin" can always be omitted. This makes the clamping
+ calculation incorrect when bias < k < bias + tmin, but that cannot
+ happen because of the way bias is computed and because of the
+ constraints on the parameters.
+
The checks for overflow are necessary to avoid producing invalid
output when the input contains very large values or is very long.
Wider integer variables can handle more extreme inputs. For IDNA,
26bit unsigned integers are sufficient, because any string that
 required a 27bit delta would have to exceed either the code point
+ needed a 27bit delta would have to exceed either the code point
limit (0..10FFFF) or the label length limit (63 characters).
The increment of delta at the bottom of the outer loop cannot
overflow because delta < length(input) before the increment, and
length(input) is already assumed to be representable. The increment
of n could overflow, but only if h == length(input), in which case
the procedure is finished anyway.
6.4 Alternative methods for handling overflow
@@ 537,37 +580,39 @@
if integer variables were capable of representing values that large.
This prevention approach would impose more restrictions on the input
than the detection approach does, but might be considered simpler in
some programming languages.
In theory, the decoder could use an analogous approach, limiting the
number of digits in a variablelength integer (that is, limiting the
number of iterations in the innermost loop). However, the number
of digits that suffice to represent a given delta can sometimes
represent much larger deltas (because of the adaptation), and hence
 this approach would probably require integers wider than 32 bits.
+ this approach would probably need integers wider than 32 bits.
Yet another approach for the decoder is to allow overflow to occur,
but to check the final output string by reencoding it and comparing
to the decoder input. If and only if they do not match (using a
caseinsensitive ASCII comparison) overflow has occurred. This
delayeddetection approach would not impose any more restrictions on
the input than the immediatedetection approach does, and might be
considered simpler in some programming languages.
In fact, if the decoder is used only inside the IDNA ToUnicode
operation [IDNA], then it need not check for overflow at all,
because ToUnicode performs a higher level reencoding and
comparison, and a mismatch has the same consequence as if the
Punycode decoder had failed.
7. Punycode example strings
+7. Punycode examples
+
+7.1 Sample strings
In the Punycode encodings below, the IDNA signature prefix is not
shown. Backslashes show where line breaks have been inserted in
strings too long for one line.
The first several examples are all translations of the sentence "Why
can't they just speak in ?" (courtesy of Michael Kaplan's
"provincial" page [PROVINCIAL]). Word breaks and punctuation have
been removed, as is often done in domain names.
@@ 670,56 +715,184 @@
Punycode: MajiKoi5783gue6qz075azm5e
(Q) de
u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
Punycode: dejg4avhby1noc0d
(R)
u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
Punycode: d9juau41awczczp
The last example is an ASCII string that breaks not only the
 existing rules for host name labels but also the rules proposed in
 [NAMEPREP] for internationalized domain names.
+ existing rules for host name labels but also the rules in [NAMEPREP]
+ for internationalized domain names.
(S) > $1.00 <
u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
u+003C u+002D
Punycode: > $1.00 <
+7.2 Decoding traces
+
+ In the following traces, the evolving state of the decoder is
+ shown as a sequence of hexadecimal values, representing the code
+ points in the extended string. An asterisk appears just after the
+ most recently inserted code point, indicating both n (the value
+ preceeding the asterisk) and i (the position of the value just after
+ the asterisk). Other numerical values are decimal.
+
+ Decoding trace of example B from section 7.1:
+
+ n is 128, i is 0, bias is 72
+ input is "ihqwcrb4cv8a8dqg056pqjye"
+ there is no delimiter, so extended string starts empty
+ delta "ihq" decodes to 19853
+ bias becomes 21
+ 4E0D *
+ delta "wc" decodes to 64
+ bias becomes 20
+ 4E0D 4E2D *
+ delta "rb" decodes to 37
+ bias becomes 13
+ 4E3A * 4E0D 4E2D
+ delta "4c" decodes to 56
+ bias becomes 17
+ 4E3A 4E48 * 4E0D 4E2D
+ delta "v8a" decodes to 599
+ bias becomes 32
+ 4E3A 4EC0 * 4E48 4E0D 4E2D
+ delta "8d" decodes to 130
+ bias becomes 23
+ 4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
+ delta "qg" decodes to 154
+ bias becomes 25
+ 4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
+ delta "056p" decodes to 46301
+ bias becomes 84
+ 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
+ delta "qjye" decodes to 88531
+ bias becomes 90
+ 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
+ Decoding trace of example L from section 7.1:
+
+ n is 128, i is 0, bias is 72
+ input is "3Bww4c5e180e575a65lsy2b"
+ literal portion is "3B", so extended string starts as:
+ 0033 0042
+ delta "ww4c" decodes to 62042
+ bias becomes 27
+ 0033 0042 5148 *
+ delta "5e" decodes to 139
+ bias becomes 24
+ 0033 0042 516B * 5148
+ delta "180e" decodes to 16683
+ bias becomes 67
+ 0033 5E74 * 0042 516B 5148
+ delta "575a" decodes to 34821
+ bias becomes 82
+ 0033 5E74 0042 516B 5148 751F *
+ delta "65l" decodes to 14592
+ bias becomes 67
+ 0033 5E74 0042 7D44 * 516B 5148 751F
+ delta "sy2b" decodes to 42088
+ bias becomes 84
+ 0033 5E74 0042 7D44 91D1 * 516B 5148 751F
+
+7.3 Encoding traces
+
+ In the following traces, code point values are hexadecimal, while
+ other numerical values are decimal.
+
+ Encoding trace of example B from section 7.1:
+
+ bias is 72
+ input is:
+ 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
+ there are no basic code points, so no literal portion
+ next code point to insert is 4E0D
+ needed delta is 19853, encodes as "ihq"
+ bias becomes 21
+ next code point to insert is 4E2D
+ needed delta is 64, encodes as "wc"
+ bias becomes 20
+ next code point to insert is 4E3A
+ needed delta is 37, encodes as "rb"
+ bias becomes 13
+ next code point to insert is 4E48
+ needed delta is 56, encodes as "4c"
+ bias becomes 17
+ next code point to insert is 4EC0
+ needed delta is 599, encodes as "v8a"
+ bias becomes 32
+ next code point to insert is 4ED6
+ needed delta is 130, encodes as "8d"
+ bias becomes 23
+ next code point to insert is 4EEC
+ needed delta is 154, encodes as "qg"
+ bias becomes 25
+ next code point to insert is 6587
+ needed delta is 46301, encodes as "056p"
+ bias becomes 84
+ next code point to insert is 8BF4
+ needed delta is 88531, encodes as "qjye"
+ bias becomes 90
+ output is "ihqwcrb4cv8a8dqg056pqjye"
+ Encoding trace of example L from section 7.1:
+
+ bias is 72
+ input is:
+ 0033 5E74 0042 7D44 91D1 516B 5148 751F
+ basic code points (0033, 0042) are copied to literal portion: "3B"
+ next code point to insert is 5148
+ needed delta is 62042, encodes as "ww4c"
+ bias becomes 27
+ next code point to insert is 516B
+ needed delta is 139, encodes as "5e"
+ bias becomes 24
+ next code point to insert is 5E74
+ needed delta is 16683, encodes as "180e"
+ bias becomes 67
+ next code point to insert is 751F
+ needed delta is 34821, encodes as "575a"
+ bias becomes 82
+ next code point to insert is 7D44
+ needed delta is 14592, encodes as "65l"
+ bias becomes 67
+ next code point to insert is 91D1
+ needed delta is 42088, encodes as "sy2b"
+ bias becomes 84
+ output is "3Bww4c5e180e575a65lsy2b"
+
8. Security considerations
Users expect each domain name in DNS to be controlled by a single
authority. If a Unicode string intended for use as a domain label
could map to multiple ACE labels, then an internationalized domain
name could map to multiple ASCII domain names, each controlled by
a different authority, some of which could be spoofs that hijack
service requests intended for another. Therefore Punycode is
designed so that each Unicode string has a unique encoding.
However, there can still be multiple Unicode representations of the
"same" text, for various definitions of "same". This problem is
addressed to some extent by the Unicode standard under the topic of
canonicalization, and this work is leveraged for domain names by
 "nameprep" [NAMEPREP].
+ Nameprep [NAMEPREP].
9. References
 [IDN] Internationalized Domain Names (IETF working group),
 http://www.idn.net/, idn@ops.ietf.org.

[IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello,
 "Internationalizing Host Names In Applications (IDNA)", 2001Nov19,
 draftietfidnidna05.
+ "Internationalizing Domain Names In Applications (IDNA)",
+ 2002#####, draftietfidnidna07.
 [NAMEPREP] Paul Hoffman, Marc Blanchet, "Stringprep
 Profile for Internationalized Host Names", 2001Sep27,
 draftietfidnnameprep06.
+ [NAMEPREP] Paul Hoffman, Marc Blanchet, "Nameprep: A Stringprep
+ Profile for Internationalized Domain Names", 2002#####,
+ draftietfidnnameprep08.
[PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
http://www.trigeminal.com/samples/provincial.html.
[RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
Table Specification", 1985Oct, RFC 952.
[RFC1034] P. Mockapetris, "Domain Names  Concepts and Facilities",
1987Nov, RFC 1034.
@@ 729,69 +902,80 @@
A. Author contact information
Adam M. Costello
University of California, Berkeley
http://www.nicemice.net/amc/
B. Mixedcase annotation
In order to use Punycode to represent caseinsensitive strings,
higher layers need to casefold the strings prior to Punycode
 encoding. The encoded string can, however, use mixed case as an
 annotation telling how to convert the original folded string into a
 mixedcase string for display purposes.
+ encoding. The encoded string can use mixed case as an annotation
+ telling how to convert the folded string into a mixedcase string
+ for display purposes. Note, however, that mixedcase annotation
+ is not used by the ToASCII and ToUnicode operations specified in
+ [IDNA], and therefore implementors of IDNA can disregard this
+ appendix.
Basic code points can use mixed case directly, because the decoder
copies them verbatim, leaving lowercase code points lowercase, and
leaving uppercase code points uppercase. Each nonbasic code point
is represented by a delta, which is represented by a sequence of
basic code points, the last of which provides the annotation. If it
is uppercase, it is a suggestion to map the nonbasic code point to
uppercase (if possible); if it is lowercase, it is a suggestion to
map the nonbasic code point to lowercase (if possible).
 Punycode encoders and decoders are not required to support these
 annotations, and higher layers need not use them.
+ These annotations do not alter the code points returned by decoders;
+ the annotations are returned separately, for the caller to use or
+ ignore. Encoders can accept annotations in addition to code points,
+ but the annotations do not alter the output, except to influence the
+ uppercase/lowercase form of ASCII letters.
C. Disclaimer and License
+ Punycode encoders and decoders need not support these annotations,
+ and higher layers need not use them.
+
+C. Disclaimer and license
Regarding this entire document or any portion of it (including
the pseudocode and C code), the author makes no guarantees and
is not responsible for any damage resulting from its use. The
author grants irrevocable permission to anyone to use, modify,
and distribute it in any way that does not diminish the rights
of anyone else to use, modify, and distribute it, provided that
redistributed derivative works do not contain misleading author or
version information. Derivative works need not be licensed under
similar terms.
D. Punycode sample implementation
/*
punycode.c 0.4.0 (2001Nov17Sat)
http://www.cs.berkeley.edu/~amc/idn/
+punycode.c from draftietfidnpunycode01
+http://www.nicemice.net/idn/
Adam M. Costello
http://www.nicemice.net/amc/
*/
/* This is ANSI C code (C89) implementing Punycode version 0.3.x. */
+This is ANSI C code (C89) implementing
+Punycode (draftietfidnpunycode01).
+
+*/
/************************************************************/
/* Public interface (would normally go in its own .h file): */
#include
enum punycode_status {
punycode_success,
punycode_bad_input, /* Input is invalid. */
punycode_big_output, /* Output would exceed the space provided. */
 punycode_overflow /* Input requires wider integers to process. */
+ punycode_overflow /* Input needs wider integers to process. */
};
#if UINT_MAX >= (1 << 26)  1
typedef unsigned int punycode_uint;
#else
typedef unsigned long punycode_uint;
#endif
enum punycode_status punycode_encode(
punycode_uint input_length,
@@ 792,73 +976,72 @@
#else
typedef unsigned long punycode_uint;
#endif
enum punycode_status punycode_encode(
punycode_uint input_length,
const punycode_uint input[],
const unsigned char case_flags[],
punycode_uint *output_length,
char output[] );

/* punycode_encode() converts Unicode to Punycode. The input */
 /* must be represented as an array of Unicode code points (not */
 /* code units; surrogate pairs are not allowed), and the output */
+ /* is represented as an array of Unicode code points (not code */
+ /* units; surrogate pairs are not allowed), and the output */
/* will be represented as an array of ASCII code points. The */
/* output string is *not* nullterminated; it will contain */
/* zeros if and only if the input contains zeros. (Of course */
/* the caller can leave room for a terminator and add one if */
 /* needed.) The input_length is the number of code points in the */
 /* input. The output_length is an in/out argument: the caller */
 /* must pass in the maximum number of code points that may be */
 /* output, and on successful return it will contain the number */
 /* of code points actually output. The case_flags array must */
 /* hold input_length boolean values, where nonzero means the */
 /* corresponding Unicode character should be forced to uppercase */
 /* after being decoded (if possible), and zero means it should */
 /* be forced to lowercase (if possible). ASCII code points */
+ /* needed.) The input_length is the number of code points in */
+ /* the input. The output_length is an in/out argument: the */
+ /* caller passes in the maximum number of code points that it */
+ /* can receive, and on successful return it will contain the */
+ /* number of code points actually output. The case_flags array */
+ /* holds input_length boolean values, where nonzero suggests that */
+ /* the corresponding Unicode character be forced to uppercase */
+ /* after being decoded (if possible), and zero suggests that */
+ /* it be forced to lowercase (if possible). ASCII code points */
/* are encoded literally, except that ASCII letters are forced */
/* to uppercase or lowercase according to the corresponding */
/* uppercase flags. If case_flags is a null pointer then ASCII */
/* letters are left as they are, and other code points are */
/* treated as if their uppercase flags were zero. The return */
 /* value may be any of the punycode_status values defined above */
+ /* value can be any of the punycode_status values defined above */
/* except punycode_bad_input; if not punycode_success, then */
 /* output_size and output may contain garbage. */
+ /* output_size and output might contain garbage. */
enum punycode_status punycode_decode(
punycode_uint input_length,
const char input[],
punycode_uint *output_length,
punycode_uint output[],
unsigned char case_flags[] );
 /* punycode_decode() converts Punycode to Unicode. The input */
 /* must be represented as an array of ASCII code points, and */
 /* the output will be represented as an array of Unicode code */
 /* points. The input_length is the number of code points in */
 /* the input. The output_length is an in/out argument: the */
 /* caller must pass in the maximum number of code points that */
 /* may be output, and on successful return it will contain the */
 /* actual number of code points output. The case_flags array */
 /* must have room for at least output_length values, or it may */
 /* be a null pointer if the case information is not needed. */
 /* A nonzero flag indicates that the corresponding Unicode */
 /* character should be forced to uppercase by the caller (if */
 /* possible), while zero means it should be forced to lowercase */
 /* (if possible). ASCII code points are output already in the */
 /* proper case, but their flags will be set appropriately so that */
 /* applying the flags would be harmless. The return value may */
 /* be any of the punycode_status values defined above; if not */
 /* punycode_success, then output_length, output, and case_flags */
 /* may contain garbage. On success, the decoder will never need */
 /* to write an output_length greater than input_length, because */
 /* of how the encoding is defined. */
+
+ /* punycode_decode() converts Punycode to Unicode. The input is */
+ /* represented as an array of ASCII code points, and the output */
+ /* will be represented as an array of Unicode code points. The */
+ /* input_length is the number of code points in the input. The */
+ /* output_length is an in/out argument: the caller passes in */
+ /* the maximum number of code points that it can receive, and */
+ /* on successful return it will contain the actual number of */
+ /* code points output. The case_flags array needs room for at */
+ /* least output_length values, or it can be a null pointer if the */
+ /* case information is not needed. A nonzero flag suggests that */
+ /* the corresponding Unicode character be forced to uppercase */
+ /* by the caller (if possible), while zero suggests that it be */
+ /* forced to lowercase (if possible). ASCII code points are */
+ /* output already in the proper case, but their flags will be set */
+ /* appropriately so that applying the flags would be harmless. */
+ /* The return value can be any of the punycode_status values */
+ /* defined above; if not punycode_success, then output_length, */
+ /* output, and case_flags might contain garbage. On success, the */
+ /* decoder will never need to write an output_length greater than */
+ /* input_length, because of how the encoding is defined. */
/**********************************************************/
/* Implementation (would normally go in its own .c file): */
#include
/*** Bootstring parameters for Punycode ***/
enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
@@ 873,22 +1056,22 @@
/* point (for use in representing integers) in the range 0 to */
/* base1, or base if cp is does not represent a value. */
static punycode_uint decode_digit(punycode_uint cp)
{
return cp  48 < 10 ? cp  22 : cp  65 < 26 ? cp  65 :
cp  97 < 26 ? cp  97 : base;
}
/* encode_digit(d,flag) returns the basic code point whose value */
/* (when used for representing integers) is d, which must be in the */
/* range 0 to base1. The lowercase form is used unless flag is */
+/* (when used for representing integers) is d, which needs to be in */
+/* the range 0 to base1. The lowercase form is used unless flag is */
/* nonzero, in which case the uppercase form is used. The behavior */
/* is undefined if flag is nonzero and digit d has no uppercase form. */
static char encode_digit(punycode_uint d, int flag)
{
return d + 22 + 75 * (d < 26)  ((flag != 0) << 5);
/* 0..25 map to ASCII a..z or A..Z */
/* 26..35 map to ASCII 0..9 */
}
@@ 993,21 +1177,22 @@
/* Punycode does not need to check whether input[j] is basic: */
if (input[j] < n /*  basic(input[j]) */ ) {
if (++delta == 0) return punycode_overflow;
}
if (input[j] == n) {
/* Represent delta as a generalized variablelength integer: */
for (q = delta, k = base; ; k += base) {
if (out >= max_out) return punycode_big_output;
 t = k <= bias ? tmin : k  bias >= tmax ? tmax : k  bias;
+ t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
+ k >= bias + tmax ? tmax : k  bias;
if (q < t) break;
output[out++] = encode_digit(t + (q  t) % (base  t), 0);
q = (q  t) / (base  t);
}
output[out++] = encode_digit(q, case_flags && case_flags[j]);
bias = adapt(delta, h + 1, h == b);
delta = 0;
++h;
}
@@ 1063,26 +1249,26 @@
/* which gets added to i. The overflow checking is easier */
/* if we increase i as we go, then subtract off its starting */
/* value at the end to obtain delta. */
for (oldi = i, w = 1, k = base; ; k += base) {
if (in >= input_length) return punycode_bad_input;
digit = decode_digit(input[in++]);
if (digit >= base) return punycode_bad_input;
if (digit > (maxint  i) / w) return punycode_overflow;
i += digit * w;
 t = k <= bias ? tmin : k  bias >= tmax ? tmax : k  bias;
+ t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
+ k >= bias + tmax ? tmax : k  bias;
if (digit < t) break;
if (w > maxint / (base  t)) return punycode_overflow;
w *= (base  t);
}

bias = adapt(i  oldi, out + 1, oldi == 0);
/* i was supposed to wrap around from out+1 to 0, */
/* incrementing n each time, so we'll fix that now: */
if (i / (out + 1) > maxint  n) return punycode_overflow;
n += i / (out + 1);
i %= (out + 1);
/* Insert n at position i of the output: */
@@ 1125,22 +1312,22 @@
{
fprintf(stderr,
"\n"
"%s e reads code points and writes a Punycode string.\n"
"%s d reads a Punycode string and writes code points.\n"
"\n"
"Input and output are plain text in the native character set.\n"
"Code points are in the form u+hex separated by whitespace.\n"
"Although the specification allows Punycode strings to contain\n"
"any characters from the ASCII repertoire, this test code\n"
 "supports only the printable characters, and requires the\n"
 "Punycode string to be followed by a newline.\n"
+ "supports only the printable characters, and needs the Punycode\n"
+ "string to be followed by a newline.\n"
"The case of the u in u+hex is the forcetouppercase flag.\n"
, argv[0], argv[0]);
exit(EXIT_FAILURE);
}
static void fail(const char *msg)
{
fputs(msg,stderr);
exit(EXIT_FAILURE);
}
@@ 1264,11 +1451,11 @@
if (r < 0) fail(io_error);
}
return EXIT_SUCCESS;
}
usage(argv);
return EXIT_SUCCESS; /* not reached, but quiets compiler warning */
}
 INTERNETDRAFT expires 2002Jul06
+ INTERNETDRAFT expires 2002Aug24