draft-ietf-idn-dude-01.txt   draft-ietf-idn-dude-02.txt 
Internet Engineering Task Force (IETF) Mark Welter INTERNET-DRAFT Mark Welter
INTERNET-DRAFT Brian W. Spolarich draft-ietf-idn-dude-02.txt Brian W. Spolarich
draft-ietf-idn-dude-01.txt WALID, Inc. Expires 2001-Dec-07 Adam M. Costello
March 02, 2001 Expires September 02, 2001 2001-Jun-07
DUDE: Differential Unicode Domain Encoding Differential Unicode Domain Encoding (DUDE)
Status of this memo Status of this Memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with
provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering
Force (IETF), its areas, and its working groups. Note that other Task Force (IETF), its areas, and its working groups. Note
groups may also distribute working documents as Internet-Drafts. that other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six
and may be updated, replaced, or obsoleted by other documents at any months and may be updated, replaced, or obsoleted by other documents
time. It is inappropriate to use Internet-Drafts as reference at any time. It is inappropriate to use Internet-Drafts as
material or to cite them other than as "work in progress." reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html
The distribution of this document is unlimited.
Copyright (c) The Internet Society (2000). All Rights Reserved. Distribution of this document is unlimited. Please send comments to
the authors or to the idn working group at idn@ops.ietf.org.
Abstract Abstract
This document describes a tranformation method for representing DUDE is a reversible transformation from a sequence of nonnegative
Unicode character codepoints in host name parts in a fashion that is integer values to a sequence of letters, digits, and hyphens (LDH
completely compatible with the current Domain Name System. It provides characters). DUDE provides a simple and efficient ASCII-Compatible
for very efficient representation of typical Unicode sequences as Encoding (ACE) of Unicode strings [UNICODE] for use with
host name parts, while preserving simplicity. It is proposed as a Internationalized Domain Names [IDN] [IDNA].
potential candidate for an ASCII-Compatible Encoding (ACE) for supporting
the deployment of an internationalized Domain Name System.
Table of Contents Contents
1. Introduction 1. Introduction
1.1 Terminology 2. Terminology
2. Hostname Part Transformation 3. Overview
2.1 Post-Converted Name Prefix 4. Base-32 characters
2.2 Radix Selection 5. Encoding procedure
2.3 Hostname Prepartion 6. Decoding procedure
2.4 Definitions 7. Example strings
2.5 DUDE Encoding 8. Security considerations
2.5.1 Extended Variable Length Hex Encoding 9. References
2.5.2 DUDE Compression Algorithm A. Acknowledgements
2.5.3 Forward Transformation Algorithm B. Author contact information
2.6 DUDE Decoding C. Mixed-case annotation
2.6.1 Extended Variable Length Hex Decoding D. Differences from draft-ietf-idn-dude-01
2.6.2 DUDE Decompression Algorithm E. Example implementation
2.6.3 Reverse Transformation Algorithm
3. Examples
4. Optional Case Preservation
5. Security Considerations
6. References
1. Introduction 1. Introduction
DUDE describes an encoding scheme of the ISO/IEC 10646 [ISO10646] The IDNA draft [IDNA] describes an architecture for supporting
character set (whose character code assignments are synchronized internationalized domain names. Each label of a domain name may
with Unicode [UNICODE3]), and the procedures for using this scheme begin with a special prefix, in which case the remainder of the
to transform host name parts containing Unicode character sequences label is an ASCII-Compatible Encoding (ACE) of a Unicode string
into sequences that are compatible with the current DNS protocol satisfying certain constraints. For the details of the constraints,
[STD13]. As such, it satisfies the definition of a 'charset' as see [IDNA] and [NAMEPREP]. The prefix has not yet been specified,
defined in [IDNREQ]. but see http://www.i-d-n.net/ for prefixes to be used for testing
and experimentation.
1.1 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].
Hexadecimal values are shown preceded with an "0x". For example,
"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
shown preceded with an "0b". For example, a nine-bit value might be
shown as "0b101101111".
Examples in this document use the notation from the Unicode Standard
[UNICODE3] as well as the ISO 10646 names. For example, the letter "a"
may be represented as either "U+0061" or "LATIN SMALL LETTER A".
DUDE converts strings with internationalized characters into
strings of US-ASCII that are acceptable as host name parts in current
DNS host naming usage. The former are called "pre-converted" and the
latter are called "post-converted". This specification defines both
a forward and reverse transformation algorithm.
2. Hostname Part Transformation
According to [STD13], hostname parts must start and end with a letter
or digit, and contain only letters, digits, and the hyphen character
("-"). This, of course, excludes most characters used by non-English
speakers, characters, as well as many other characters in the ASCII
character repertoire. Further, domain name parts must be 63 octets or
shorter in length.
2.1 Post-Converted Name Prefix
This document defines the string 'dq--' as a prefix to identify
DUDE-encoded sequences. For the purposes of comparison in the IDN
Working Group activities, the 'dq--' prefix should be used solely to
identify DUDE sequences. However, should this document proceed beyond
draft status the prefix should be changed to whatever prefix, if any,
is the final consensus of the IDN working group.
Note that the prepending of a fixed identifier sequence is only one
mechanism for differentiating ASCII character encoded international
domain names from 'ordinary' domain names. One method, as proposed in
[IDNRACE], is to include a character prefix or suffix that does not
appear in any name in any zone file. A second method is to insert a
domain component which pushes off any international names one or more
levels deeper into the DNS hierarchy. There are trade-offs between
these two methods which are independent of the Unicode to ASCII
transcoding method finally chosen. We do not address the international
vs. 'ordinary' name differention issue in this paper.
2.2 Radix Selection
There are many proposed methods for representing Unicode characters
within the allowed target character set, which can be split into groups
on the basis of the underlying radix. We have chosen a method with
radix 16 because both UTF-32 and ASCII are represented by even multiples
of four bits. This allows a Unicode character to be encoded as a
whole number of ASCII characters, and permits easier manipulation of
the resulting encoded data by humans.
2.3 Hostname Preparation
The hostname part is assumed to have at least one character disallowed
by [STD13], and that is has been processed for logically equivalent
character mapping, filtering of disallowed characters (if any), and
compatibility composition/decomposition before presentation to the DUDE
conversion algorithm.
While it is possible to invent a transcoding mechanism that relies
on certain Unicode characters being deemed illegal within domain names
and hence available to the transcoding mechanism for improving encoding
efficiency, we feel that such a proposal would complicate matters
excessively.
2.4 Definitions
For clarity:
'integer' is an unsigned binary quantity;
'byte' is an 8-bit integer quantity;
'nibble' is a 4-bit integer quantity.
2.5 DUDE Encoding
The idea behind this scheme is to provide compression by encoding the
contiguous least significant nibbles of a character that differ from the
preceding character. Using a variant of the variable length hex encoding
desribed in [IDNDUERST] and elsewhere, by encoding leading zero nibbles
this technique allows recovery of the differential length. The encoding
is, with some practice, easy to perform manually.
2.5.1 Extended Variable Length Hex Encoding
The variable length hex encoding algorithm was introduced by Duerst in
[IDNDUERST]. It encodes an integer value in a slight modification of
traditional hexadecimal notation, the difference being that the most
significant digit is represented with an alternate set of "digits"
- -- 'g through 'v' are used to represent 0 through 15. The result is a
variable length encoding which can efficiently represent integers of
arbitrary length.
This specification extends the variable length hex encoding algorithm
to support the compression scheme defined below by potentially not
supressing leading zero nibbles.
The extended variable length nibble encoding of an integer, C,
to length N, is defined as follows:
1. Start with I, the Nth least significant nibble from the least
significant nibble of C;
2. Emit the Ith character of the sequence [ghijklmnopqrstuv];
3. Continue from the most to least significant, encoding each
remaining nibble J by emitting the Jth character of the
sequence [0123456789abcdef].
2.5.2 DUDE Compression Algorithm
1. Let PREV = 0;
2. If there are no more characters in the input, terminate successfully;
4. Let C be the next character in the input;
5. If C != '-' , then go to step 7; DUDE is intended to be used as an ACE within IDNA, and has been
designed to have the following features:
6. Consume the input character, emit '-', and go to step 2; * Completeness: Every sequence of nonnegative integers maps to an
LDH string. Restrictions on which integers are allowed, and on
sequence length, may be imposed by higher layers.
7. Let D be the result of PREV exclusive ORed with C; * Uniqueness: Every sequence of nonnegative integers maps to at
most one LDH string.
8. Find the least positive value N such that * Reversibility: Any Unicode string mapped to an LDH string can
D bitwise ANDed with M is zero be recovered from that LDH string.
where M = the bitwise complement of (16**N) - 1;
9. Let V be C ANDed with the bitwise complement of M; * Efficient encoding: The ratio of encoded size to original size
is small. This is important in the context of domain names
because [RFC1034] restricts the length of a domain label to 63
characters.
10. Variable length hex encode V to length N and emit the result; * Simplicity: The encoding and decoding algorithms are reasonably
simple to implement. The goals of efficiency and simplicity are
at odds; DUDE places greater emphasis on simplicity.
11. Let PREV = C and go to step 2. An optional feature is described in appendix C "Mixed-case
annotation".
2.5.3 Forward Transformation Algorithm 2. Terminology
The DUDE transformation algorithm accepts a string in UTF-32 The key words "must", "shall", "required", "should", "recommended",
[UNICODE3] format as input. It is assumed that prior nameprep and "may" in this document are to be interpreted as described in
processing has disallowed the private use code points in RFC 2119 [RFC2119].
0X100000 throuh 0X10FFFF, so that we are left with the task of
encoding 20 bit integers. The encoding algorithm is as follows:
1. Break the hostname string into dot-separated hostname parts. LDH characters are the letters A-Z and a-z, the digits 0-9, and
For each hostname part which contains one or more characters hyphen-minus.
disallowed by [STD13], perform steps 2 and 3 below;
2. Compress the hostname part using the method described in section A quartet is a sequence of four bits (also known as a nibble or
2.5.2 above, and encode using the encoding described in section nybble).
2.5.1;
3. Prepend the post-converted name prefix 'dq--' (see section 2.1 A quintet is a sequence of five bits.
above) to the resulting string.
2.6 DUDE Decoding Hexadecimal values are shown preceeded by "0x". For example, 0x60
is decimal 96.
2.6.1 Extended Variable Length Hex Decoding As in the Unicode Standard [UNICODE], Unicode code points are
denoted by "U+" followed by four to six hexadecimal digits, while a
range of code points is denoted by two hexadecimal numbers separated
by "..", with no prefixes.
Decoding extended variable length hex encoded strings is identical XOR means bitwise exclusive or. Given two nonnegative integer
to the standard variable length hex encoding, and is defined as values A and B, A XOR B is the nonnegative integer value whose
follows: binary representation is 1 in whichever places the binary
representations of A and B disagree, and 0 wherever they agree.
For the purpose of applying this rule, recall that an integer's
representation begins with an infinite number of unwritten zeros.
In some programming languages, care may need to be taken that A and
B are stored in variables of the same type and size.
1. Let CL be the lower case of the first input character, 3. Overview
If CL is not in set [ghijklmnopqrstuv], DUDE encodes a sequence of nonnegative integral values as a sequence
return error, of LDH characters, although implementations will of course need to
else represent the output characters somehow, typically as ASCII octets.
consume the input character; When DUDE is used to encode Unicode characters, the input values are
Unicode code points (integral values in the range 0..10FFFF, but not
D800..DFFF, which are reserved for use by UTF-16).
2. Let R = CL - 'g', Each value in the input sequence is represented by one or more LDH
Let N = 1; characters in the encoded string. The value 0x2D is represented
by hyphen-minus (U+002D). Each non-hyphen-minus character in
the encoded string represents a quintet. A sequence of quintets
represents the bitwise XOR between each non-0x2D integer and the
previous one.
3. If no more input characters exist, go to step 9. 4. Base-32 characters
4. Let CL be the lower case of the next input character; "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000
"b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001
"c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010
"d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011
"e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100
"f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101
"g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110
"h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111
"i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000
"j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001
"k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010
"m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011
"n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100
"p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101
"q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110
"r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111
5. If CL is not in the set [0123456789abcdef], go to Step 9; The digits "0" and "1" and the letters "o" and "l" are not used, to
avoid transcription errors.
6. Consume the next input character, A decoder must accept both the uppercase and lowercase forms of
Let N = N + 1; the base-32 characters (including mixtures of both forms). An
Let R = R * 16; encoder should output only lowercase forms or only uppercase forms
(unless it uses the feature described in the appendix C "Mixed-case
annotation").
7. If N is in set [0123456789], 5. Encoding procedure
then let R = R + (N - '0')
else let R = R + (N - 'a') + 10;
8. Go to step 3; All ordering of bits, quartets, and quintets is big-endian (most
significant first).
9. Let MASK be the bitwise complement of (16**N) - 1; let prev = 0x60
for each input integer n (in order) do begin
if n == 0x2D then output hyphen-minus
else begin
let diff = prev XOR n
represent diff in base 16 as a sequence of quartets,
as few as are sufficient (but at least one)
prepend 0 to the last quartet and 1 to each of the others
output a base-32 character corresponding to each quintet
let prev = n
end
end
10. Return decoded result R as well as MASK. If an encoder encounters an input value larger than expected (for
example, the largest Unicode code point is U+10FFFF, and nameprep
[NAMEPREP03] can never output a code point larger than U+EFFFD),
the encoder may either encode the value correctly, or may fail, but
it must not produce incorrect output. The encoder must fail if it
encounters a negative input value.
2.6.2 DUDE Decompression Algorithm 6. Decoding procedure
1. Let PREV = 0; let prev = 0x60
while the input string is not exhausted do begin
if the next character is hyphen-minus
then consume it and output 0x2D
else begin
consume characters and convert them to quintets until
encountering a quintet whose first bit is 0
fail upon encountering a non-base-32 character or end-of-input
strip the first bit of each quintet
concatenate the resulting quartets to form diff
let prev = prev XOR diff
output prev
end
end
encode the output sequence and compare it to the input string
fail if they do not match (case-insensitively)
2. If there are no more input characters then terminate successfully; The comparison at the end is necessary to guarantee the uniqueness
property (there cannot be two distinct encoded strings representing
the same sequence of integers). This check also frees the decoder
from having to check for overflow while decoding the base-32
characters. (If the decoder is one step of a larger decoding
process, it may be possible to defer the re-encoding and comparison
to the end of that larger decoding process.)
3. Let C be the next input character; 7. Example strings
4. If C == '-', append '-' to the result string, consume the character, The first several examples are nonsense strings of mostly unassigned
and go to step 2, code points intended to exercise the corner cases of the algorithm.
5. Let VPART, MASK be the next extended variable length hex decoded (A) u+0061
value and mask; DUDE: b
6. If VPART > 0xFFFFF then return error status, (B) u+2C7EF u+2C7EF
DUDE: u6z2ra
(C) u+1752B u+1752A
DUDE: tzxwmb
7. Let CU = ( PREV bitwise-AND MASK) + VPART, (D) u+63AB1 u+63ABA
Let PREV = CU; DUDE: yv47bm
8. Append the UTF-32 character CU to the result string; (E) u+261AF u+261BF
DUDE: uyt6rta
9. Go to step 2. (F) u+C3A31 u+C3A8C
DUDE: 6v4xb5p
2.6.3 Reverse Transformation Algorithm (G) u+09F44 u+0954C
DUDE: 39ue4si
1. Break the string into dot-separated components and apply Steps (H) u+8D1A3 u+8C8A3
2 through 4 to each component; DUDE: 27t6dt3sa
2. Remove the post converted name prefix 'dq--' (see Section 2.1); (I) u+6C2B6 u+CC266
DUDE: y6u7g4ss7a
3. Decompress the component using the decompression algorithm (J) u+002D u+002D u+002D u+E848F
described above (which in turn invokes the decoding algorithm DUDE: ---82w8r
also described above);
4. Concatenate the decoded segments with dot separators and return. (K) u+BD08E u+002D u+002D u+002D
DUDE: 57s8q---
3. Examples (L) u+A9A24 u+002D u+002D u+002D u+C05B7
DUDE: 434we---y393d
The examples below illustrate the encoding algorithm. Allowed RFC1035 (M) u+7FFFFFFF
characters, including period [U+002E] and dash [U+002D] are shown as DUDE: z999993r or explicit failure
literals in the UTF-16 version of the example. DUDE is compared to
LACE as proposed in [IDNLACE]. A comprehensive comparison of ACE
proposals is outside of the scope of this document. However we believe
that DUDE shows a good balance between efficiency (resulting in shorter
ACE sequences for typical names) and complexity.
3.1 'www.walid.com' [Arabic]: The next several examples are realistic Unicode strings that could
be used in domain names. They exhibit single-row text, two-row
text, ideographic text, and mixtures thereof. These examples are
names of Japanese television programs, music artists, and songs,
merely because one of the authors happened to have them handy.
UTF-16: U+0645 U+0648 U+0642 U+0639 . U+0648 U+0644 U+064A U+062F . (N) 3<nen>b<gumi><kinpachi><sensei> (Latin, kanji)
U+0634 U+0631 U+0643 U+0629 u+0033 u+5E74 u+0062 u+7D44 u+91D1 u+516B u+5148 u+751F
DUDE: xdx8whx8tgz7ug863f6s5kuduwxh
DUDE: dq--m45oij9.dq--m48kqif.dq--m34hk3i9 (O) <amuro><namie>-with-super-monkeys (Latin, kanji, hyphens)
u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
u+0068 u+002D u+0073 u+0075 u+0070 u+0065 u+0072 u+002D u+006D
u+006F u+006E u+006B u+0065 u+0079 u+0073
DUDE: x58jupu8nuy6gt99m-yssctqtptn-tmgftfth-trcbfqtnk
LACE: bq--aqdekscche.bq--aqdeqrckf5.bq--aqddimkdfe (P) maji<de>koi<suru>5<byou><mae> (Latin, hiragana, kanji)
u+006D u+0061 u+006A u+0069 u+3067 u+006B u+006F u+0069 u+3059
u+308B u+0035 u+79D2 u+524D
DUDE: pnmdvssqvssnegvsva7cvs5qz38hu53r
3.2 'Abugazalah-Intellectual-Property.com' [Arabic]: (Q) <pafii>de<runba> (Latin, katakana)
u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
DUDE: vs5bezgxrvs3ibvs2qtiud
(R) <sono><supiido><de> (hiragana, katakana)
u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
DUDE: vsvpvd7hypuivf4q
UTF-16: U+0623 U+0628 U+0648 U+063A U+0632 U+0627 U+0644 U+0629 - 8. Security considerations
U+0644 U+0644 U+0645 U+0644 U+0643 U+064A U+0629 - U+0627
U+0644 U+0641 U+0643 U+0631 U+064A U+0629 . U+0634 U+0631
U+0643 U+0629
DUDE: dq--m23ok8jaii7k4i9-m44klkjqi9-m27k4hjj1kai9.dq--m34hk3i9 Users expect each domain name in DNS to be controlled by a single
authority. If a Unicode string intended for use as a domain label
could map to multiple ACE labels, then an internationalized domain
name could map to multiple ACE domain names, each controlled by
a different authority, some of which could be spoofs that hijack
service requests intended for another. Therefore DUDE is designed
so that each Unicode string has a unique encoding.
LACE: bq--badcgkcihizcorbjaeac2bygircekrcdjiuqcabna4dcorcbimyuuki. However, there can still be multiple Unicode representations of the
bq--aqddimkdfe "same" text, for various definitions of "same". This problem is
addressed to some extent by the Unicode standard under the topic of
canonicalization, and this work is leveraged for domain names by
"nameprep" [NAMEPREP03].
3.3 'King-Hussain.person.jr' [Arabic] 9. References
UTF-16: U+0627 U+0644 U+0645 U+0644 U+0643 - U+062D U+0633 U+064A [IDN] Internationalized Domain Names (IETF working group),
U+0646 . U+0634 U+062E U+0635 . U+0627 U+0644 U+0623 U+0631 http://www.i-d-n.net/, idn@ops.ietf.org.
U+062F U+0646
DUDE: dq--m27k4lkj-m2dj3kam.dq--m34iej5.dq--m27k4i3j1ifk6 [IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host
Names In Applications (IDNA)", draft-ietf-idn-idna-01.
LACE: bq--audcorcfirbqcabnaudegljtjjda.bq--amddilrv. [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
bq--aydcorbdgexum of Internationalized Host Names", 2001-Feb-24,
draft-ietf-idn-nameprep-03.
3.4 'Jordanian-Dental-Center.com.jr' [Arabic] [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
Table Specification", 1985-Oct, RFC 952.
UTF-16: U+0645 U+0631 U+0643 U+0632 - U+0627 U+0644 U+0623 U+0631 U+062F [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
U+0646 - U+0644 U+0644 U+0623 U+0633 U+0646 U+0627 U+0646 . 1987-Nov, RFC 1034.
U+0634 U+0631 U+0643 U+0629 . U+0627 U+0644 U+0623 U+0631 U+062F
U+0646
DUDE: dq--m45j1k3j2-m27k4i3j1ifk6-m44ki3j3k6i7k6.dq--m34hk3i9. [RFC1123] Internet Engineering Task Force, R. Braden (editor),
dq--m27k4i3j1ifk6 "Requirements for Internet Hosts -- Application and Support",
1989-Oct, RFC 1123.
LACE: bq--aqdekmkdgiaqaligaytuiizrf5dacabna4deirbdgndcorq. [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
bq--aqddimkdfe.bq--aydcorbdgexum Requirement Levels", 1997-Mar, RFC 2119.
3.5 'Mahindra.com' [Hindi]: [SFS] David Mazieres et al, "Self-certifying File System",
http://www.fs.net/.
UTF-16: U+092E U+0939 U+093F U+0928 U+094D U+0926 U+094D U+0930 [UNICODE] The Unicode Consortium, "The Unicode Standard",
U+093E . U+0935 U+094D U+092F U+093E U+092A U+093E U+0930 http://www.unicode.org/unicode/standard/standard.html.
DUDE: dq--p2ej9vi8kdi6kdj0u.dq--p35kdifjeiajeg A. Acknowledgements
LACE: bq--bees4oj7fbgsmtjqhy.bq--a4etktjphyvd4ma The basic encoding of integers to quartets to quintets to base-32
comes from earlier IETF work by Martin Duerst. DUDE uses a slight
variation on the idea.
3.6 'Webdunia.com' [Hindi]: Paul Hoffman provided helpful comments on this document.
UTF-16: U+0935 U+0947 U+092C U+0926 U+0941 U+0928 U+093F U+092F The idea of avoiding 0, 1, o, and l in base-32 strings was taken
U+093E . U+0935 U+094D U+092F U+093E U+092A U+093E U+0930 from SFS [SFS].
DUDE: dq--p35k7icmk1i8jfifje.dq--p35kdifjeiajeg B. Author contact information
LACE: bq--beetkrzmezasqpzphy.bq--a4etktjphyvd4ma Mark Welter <mwelter@walid.com>
Brian W. Spolarich <briansp@walid.com>
WALID, Inc.
State Technology Park
2245 S. State St.
Ann Arbor, MI 48104
+1 734 822 2020
3.7 'Chinese Finance.com' [Traditional Chinese] Adam M. Costello <amc@cs.berkeley.edu>
University of California, Berkeley
http://www.cs.berkeley.edu/~amc/
UTF-16: U+4E2D U+83EF U+8CA1 U+7D93 . c o m C. Mixed-case annotation
DUDE: dq--ke2do3efsa1nd93.com In order to use DUDE to represent case-insensitive Unicode strings,
higher layers need to case-fold the Unicode strings prior to DUDE
encoding. The encoded string can, however, use mixed-case base-32
(rather than all-lowercase or all-uppercase as recommended in
section 4 "Base-32 characters") as an annotation telling how to
convert the folded Unicode string into a mixed-case Unicode string
for display purposes.
LACE: bq--75hc3a7prsqx3ey.com Each Unicode code point (unless it is U+002D hyphen-minus) is
represented by a sequence of base-32 characters, the last of which
is always a letter (as opposed to a digit). If that letter is
uppercase, it is a suggestion that the Unicode character be mapped
to uppercase (if possible); if the letter is lowercase, it is a
suggestion that the Unicode character be mapped to lowercase (if
possible).
3.8 'Chinese Readers.net' [Chinese] DUDE encoders and decoders are not required to support these
annotations, and higher layers need not use them.
UTF-16: U+842C U+7DAD U+8B80 U+8005 . U+7DB2 U+7D61 Example: In order to suggest that example O in section 7 "Example
strings" be displayed as:
DUDE: dq--o42cndadob80g05.dq--ndb2m1 <amuro><namie>-with-SUPER-MONKEYS
LACE: bq--76ccy7nnroaiabi.bq--aj63eyi one could capitalize the DUDE encoding as:
3.9 'Russian-Standard.com.ru' [Russian] x58jupu8nuy6gt99m-yssctqtptn-tMGFtFtH-tRCBFQtNK
UTF-16: U+0440 U+0443 U+0441 U+0441 U+043A U+0438 U+0439 - D. Differences from draft-ietf-idn-dude-01
U+0441 U+0442 U+0430 U+043D U+0434 U+0430 U+0440 U+0442 .
U+043A U+043E U+043C . U+0440 U+0444
DUDE: dq--k40jhhjaop-k3ausk1ij0tkgk0i.dq--k3aus.dq--k40k Four changes have been made since draft-ietf-idn-dude-01 (DUDE-01):
LACE: bq--a4ceaq2bie5dqoibaawqqbcbiiyd2nbqibba.bq--amcdupr4. 1) DUDE-01 computed the XOR of each integer with the previous one
bq--aiceara in order to decide how many bits of each integer to encode, but
now the XOR itself is encoded, so there is no need for a mask.
3.10 'Vladimir-Putin.person.ru' [Russian] 2) DUDE-01 made the first quintet of each sequence different from
the rest, while now it is the last quintet that differs, so it's
easier for the decoder to detect the end of the sequence.
UTF-16: U+0432 U+043B U+0430 U+0434 U+0438 U+043C U+0438 U+0440 - 3) The base-32 map has changed to avoid 0, 1, o, and l, to help
U+043F U+0443 U+0442 U+0438 U+043D . U+043B U+0438 U+0447 humans avoid transcription errors.
U+043D U+043E U+0441 U+0442 U+044C . U+0440 U+0444 U+0020
DUDE: dq--k32rgkosok0-k3fk3ij8t.dq--k3bok7jduk1is.dq--k40k 4) The initial value of the previous code point has changed from 0
to 0x60, making the encodings of a few domain names shorter and
none longer.
LACE: bq--bacdeozqgq4dyocaaeac2bieh5bueob5. E. Example implementation
bq--bacdwochhu7ecqsm.bq--aiceara
4. Optional Case Preservation /******************************************/
/* dude.c 0.2.3 (2001-May-31-Thu) */
/* Adam M. Costello <amc@cs.berkeley.edu> */
/******************************************/
An extension to the DUDE concept recognizes that the first /* This is ANSI C code (C89) implementing */
character emitted by the variable length hex encoding algorithm is /* DUDE (draft-ietf-idn-dude-02). */
always alphabetic. We encode the case (if any) of the original Unicode
character in the case of the initial "hex" character. Because the DNS
performs case-insensitive comparisons, mixed case international domain
names behave in exactly the same way as traditional domain names.
In particular, this enables reverse lookups to return names in the
preferred case.
In contrast to other proposals as of this writing, such a case preserving /************************************************************/
version of DUDE will interoperate with the non case preserving version. /* Public interface (would normally go in its own .h file): */
Despite the foregoing, we feel that the additional complexity of tracking #include <limits.h>
character case through the nameprep processing is not warranted by the
marginal utility of the result.
5. Security Considerations enum dude_status {
dude_success,
dude_bad_input,
dude_big_output /* Output would exceed the space provided. */
};
Much of the security of the Internet relies on the DNS and any enum case_sensitivity { case_sensitive, case_insensitive };
change to the characteristics of the DNS may change the security of
much of the Internet. Therefore DUDE makes no changes to the DNS itself.
DUDE is designed so that distinct Unicode sequences map to distinct #if UINT_MAX >= 0x1FFFFF
domain name sequences (modulo the Unicode and DNS equivalence rules). typedef unsigned int u_code_point;
Therefore use of DUDE with DNS will not negatively affect security below #else
the application level. typedef unsigned long u_code_point;
#endif
If an application has security reliance on the Unicode string S, produced enum dude_status dude_encode(
by an inverse ACE transformation of a name T, the application must verify unsigned int input_length,
that the nameprepped and ACE encoded result of S is DNS-equivalent to T. const u_code_point input[],
const unsigned char uppercase_flags[],
unsigned int *output_size,
char output[] );
/* dude_encode() converts Unicode to DUDE (without any */
/* signature). The input must be represented as an array */
/* of Unicode code points (not code units; surrogate pairs */
/* are not allowed), and the output will be represented as */
/* null-terminated ASCII. The input_length is the number of code */
/* points in the input. The output_size is an in/out argument: */
/* the caller must pass in the maximum number of characters */
/* that may be output (including the terminating null), and on */
/* successful return it will contain the number of characters */
/* actually output (including the terminating null, so it will be */
/* one more than strlen() would return, which is why it is called */
/* output_size rather than output_length). The uppercase_flags */
/* array must hold input_length boolean values, where nonzero */
/* means the corresponding Unicode character should be forced */
/* to uppercase after being decoded, and zero means it is */
/* caseless or should be forced to lowercase. Alternatively, */
/* uppercase_flags may be a null pointer, which is equivalent */
/* to all zeros. The encoder always outputs lowercase base-32 */
/* characters except when nonzero values of uppercase_flags */
/* require otherwise. The return value may be any of the */
/* dude_status values defined above; if not dude_success, then */
/* output_size and output may contain garbage. On success, the */
/* encoder will never need to write an output_size greater than */
/* input_length*k+1 if all the input code points are less than 1 */
/* << (4*k), because of how the encoding is defined. */
6. Change History enum dude_status dude_decode(
enum case_sensitivity case_sensitivity,
char scratch_space[],
const char input[],
unsigned int *output_length,
u_code_point output[],
unsigned char uppercase_flags[] );
/* dude_decode() converts DUDE (without any signature) to */
/* Unicode. The input must be represented as null-terminated */
/* ASCII, and the output will be represented as an array of */
/* Unicode code points. The case_sensitivity argument influences */
/* the check on the well-formedness of the input string; it */
/* must be case_sensitive if case-sensitive comparisons are */
/* allowed on encoded strings, case_insensitive otherwise. */
/* The scratch_space must point to space at least as large */
/* as the input, which will get overwritten (this allows the */
/* decoder to avoid calling malloc()). The output_length is */
/* an in/out argument: the caller must pass in the maximum */
/* number of code points that may be output, and on successful */
/* return it will contain the actual number of code points */
/* output. The uppercase_flags array must have room for at */
/* least output_length values, or it may be a null pointer if */
/* the case information is not needed. A nonzero flag indicates */
/* that the corresponding Unicode character should be forced to */
/* uppercase by the caller, while zero means it is caseless or */
/* should be forced to lowercase. The return value may be any */
/* of the dude_status values defined above; if not dude_success, */
/* then output_length, output, and uppercase_flags may contain */
/* garbage. On success, the decoder will never need to write */
/* an output_length greater than the length of the input (not */
/* counting the null terminator), because of how the encoding is */
/* defined. */
The statement that we intended to submit a Nameprep draft was removed in /**********************************************************/
light of the changes made between the frist and second nameprep drafts. /* Implementation (would normally go in its own .c file): */
The details of DUDE extensions for case preservation etc. have been #include <string.h>
removed. Basic DUDE was changed to operate over the relevant 20 bit
UTF32 code points.
Examples have been extended. /* Character utilities: */
ACE security issues were clarified. /* base32[q] is the lowercase base-32 character representing */
/* the number q from the range 0 to 31. Note that we cannot */
/* use string literals for ASCII characters because an ANSI C */
/* compiler does not necessarily use ASCII. */
7. References static const char base32[] = {
97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */
109, 110, /* m-n */
112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */
50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */
};
[IDNCOMP] Paul Hoffman, "Comparison of Internationalized Domain Name /* base32_decode(c) returns the value of a base-32 character, in the */
Proposals", draft-ietf-idn-compare; /* range 0 to 31, or the constant base32_invalid if c is not a valid */
/* base-32 character. */
[IDNrACE] Paul Hoffman, "RACE: Row-Based ASCII Compatible Encoding for enum { base32_invalid = 32 };
IDN", draft-ietf-idn-race;
[IDNLACE] Mark Davis, "LACE: Length-Based ASCII Compatible Encoding for static unsigned int base32_decode(char c)
IDN", draft-ietf-idn-lace; {
if (c < 50) return base32_invalid;
if (c <= 57) return c - 26;
if (c < 97) c += 32;
if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
return c - 97 - (c > 108) - (c > 111);
}
[IDNREQ] James Seng, "Requirements of Internationalized Domain Names", /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
draft-ietf-idn-requirement; /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */
/* then ASCII A-Z are considered equal to a-z respectively. */
[IDNNAMEPREP] Paul Hoffman and Marc Blanchet, "Preparation of static int unequal( enum case_sensitivity case_sensitivity,
Internationalized Host Names", draft-ietf-idn-nameprep; const char s1[], const char s2[] )
{
char c1, c2;
[IDNDUERST] M. Duerst, "Internationalization of Domain Names", if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;
draft-duerst-dns-i18n;
[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information for (;;) {
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- c1 = *s1;
Part 1: Architecture and Basic Multilingual Plane. Five amendments and c2 = *s2;
a technical corrigendum have been published up to now. UTF-16 is if (c1 >= 65 && c1 <= 90) c1 += 32;
described in Annex Q, published as Amendment 1. 17 other amendments are if (c2 >= 65 && c2 <= 90) c2 += 32;
currently at various stages of standardization; if (c1 != c2) return 1;
if (c1 == 0) return 0;
++s1, ++s2;
}
}
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate /* Encoder: */
Requirement Levels", March 1997, RFC 2119;
[STD13] Paul Mockapetris, "Domain names - implementation and enum dude_status dude_encode(
specification", November 1987, STD 13 (RFC 1035); unsigned int input_length,
const u_code_point input[],
const unsigned char uppercase_flags[],
unsigned int *output_size,
char output[] )
{
unsigned int max_out, in, out, k, j;
u_code_point prev, codept, diff, tmp;
char shift;
[UNICODE3] The Unicode Consortium, "The Unicode Standard -- Version prev = 0x60;
3.0", ISBN 0-201-61633-5. Described at max_out = *output_size;
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
A. Acknowledgements for (in = out = 0; in < input_length; ++in) {
The structure (and some of the structural text) of this document is /* At the start of each iteration, in and out are the number of */
intentionally borrowed from the LACE IDN draft (draft-ietf-idn-lace-00) /* items already input/output, or equivalently, the indices of */
by Mark Davis and Paul Hoffman. /* the next items to be input/output. */
codept = input[in];
B. IANA Considerations if (codept == 0x2D) {
/* Hyphen-minus stands for itself. */
if (max_out - out < 1) return dude_big_output;
output[out++] = 0x2D;
continue;
}
There are no IANA considerations in this document. diff = prev ^ codept;
C. Author Contact Information /* Compute the number of base-32 characters (k): */
for (tmp = diff >> 4, k = 1; tmp != 0; ++k, tmp >>= 4);
Mark Welter if (max_out - out < k) return dude_big_output;
Brian W. Spolarich shift = uppercase_flags && uppercase_flags[in] ? 32 : 0;
WALID, Inc. /* shift controls the case of the last base-32 digit. */
State Technology Park
2245 S. State St.
Ann Arbor, MI 48104
+1-734-822-2020
mwelter@walid.com /* Each quintet has the form 1xxxx except the last is 0xxxx. */
briansp@walid.com /* Computing the base-32 digits in reverse order is easiest. */
D. DUDE C++ Implementation out += k;
output[out - 1] = base32[diff & 0xF] - shift;
#include <stdio.h> for (j = 2; j <= k; ++j) {
#include <string.h> diff >>= 4;
#include <ctype.h> output[out - j] = base32[0x10 | (diff & 0xF)];
#include <limits.h> }
#define IDN_ERROR INT_MIN prev = codept;
}
#define DUDETAG "dq--" /* Append the null terminator: */
if (max_out - out < 1) return dude_big_output;
output[out++] = 0;
typedef unsigned int uchar_t; *output_size = out;
return dude_success;
}
bool idn_isRFC1035(const uchar_t * in, int len) /* Decoder: */
{
const uchar_t * end = in + len;
while (in < end) enum dude_status dude_decode(
enum case_sensitivity case_sensitivity,
char scratch_space[],
const char input[],
unsigned int *output_length,
u_code_point output[],
unsigned char uppercase_flags[] )
{ {
if ((*in > 127) || u_code_point prev, q, diff;
!strchr("abcdefghijklmnopqrstuvwxyz0123456789-.", tolower(*in))) char c;
return false; unsigned int max_out, in, out, scratch_size;
in++; enum dude_status status;
}
return true;
}
static const char *hexchar = "0123456789abcdef"; prev = 0x60;
static const char *leadchar = "ghijklmnopqrstuv"; max_out = *output_length;
for (c = input[in = 0], out = 0; c != 0; c = input[++in], ++out) {
/* /* At the start of each iteration, in and out are the number of */
dudehex -- convert an integer, v, into n DUDE hex characters. /* items already input/output, or equivalently, the indices of */
The result is placed in ostr. The buffer ends at the byte before /* the next items to be input/output. */
eop, and false is returned to indicate insufficient buffer space.
*/
static bool dudehex(char * & ostr, const char * eop,
unsigned int v, int n)
{
if ((ostr + n) >= eop)
return false;
n--; // convert to zero origin if (max_out - out < 1) return dude_big_output;
*ostr++ = leadchar[(v >> (n << 2)) & 0x0F]; if (c == 0x2D) output[out] = c; /* hyphen-minus is literal */
else {
/* Base-32 sequence. Decode quintets until 0xxxx is found: */
while (n > 0) for (diff = 0; ; c = input[++in]) {
{ q = base32_decode(c);
n--; if (q == base32_invalid) return dude_bad_input;
*ostr++ = hexchar[(v >> (n << 2)) & 0x0F]; diff = (diff << 4) | (q & 0xF);
} if (q >> 4 == 0) break;
return true;
} }
/* prev = output[out] = prev ^ diff;
idn_dudeseg converts istr, a utf-32 domain name segment into DUDE. }
eip points at the character after the input segment.
ostr points at an output buffer which ends just before eop.
If there is insufficient buffer space, the function return is false.
Invalid surrogate sequences will also cause a return of false.
*/
static bool idn_dudeseg(const uchar_t * istr, const uchar_t * eip,
char * & ostr, char * eop)
{
const uchar_t * ip = istr;
unsigned p = 0;
while (ip < eip) /* Case of last character determines uppercase flag: */
{ if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90;
if (*ip == '-') }
*ostr++ = *ip;
else // if (validnc(*ip))
{
unsigned int c = *ip;
unsigned d = p ^ c; // d now has the difference (xor) /* Enforce the uniqueness of the encoding by re-encoding */
// between the current and previous char /* the output and comparing the result to the input: */
int n = 1; // Count the number of significant nibbles scratch_size = ++in;
while (d >>= 4) status = dude_encode(out, output, uppercase_flags,
n++; &scratch_size, scratch_space);
if (status != dude_success || scratch_size != in ||
unequal(case_sensitivity, scratch_space, input)
) return dude_bad_input;
dudehex(ostr, eop, c, n); *output_length = out;
p = c; return dude_success;
}
ip++;
}
*ostr = 0;
return true;
} }
/* /******************************************************************/
idn_UTF32toDUDE converts a UTF-32 domain name into DUDE. /* Wrapper for testing (would normally go in a separate .c file): */
in, a UTF-32 vector of length inlen is the input domain name.
outstr is a char output buffer of length outmax.
On success, the number of output characters is returned.
On failure, a negative number is returned.
It is assumed that the input has been nameprepped.
If this routine is used in a registration context, segment and
overall length restrictions must be checked by the user.
*/
int idn_UTF32toDUDE(const uchar_t * in, int inlen, char *outstr, int outmax) #include <assert.h>
{ #include <stdio.h>
const uchar_t *ip = in; #include <stdlib.h>
const uchar_t *eip = in + inlen; #include <string.h>
const uchar_t *ep = ip;
char *op = outstr;
char *eop = outstr + outmax - 1;
while (ip < eip) /* For testing, we'll just set some compile-time limits rather than */
{ /* use malloc(), and set a compile-time option rather than using a */
ep = ip; /* command-line option. */
while ((ep < eip) && (*ep != '.'))
ep++;
const char * tagp = DUDETAG; // prefix the segment enum {
while (*tagp) // with the tag (dq--) unicode_max_length = 256,
{ ace_max_size = 256,
if (op >= eop) test_case_sensitivity = case_insensitive
{ /* suitable for host names */
*outstr = '\0'; };
return IDN_ERROR;
}
*op++ = *tagp++;
}
if (idn_isRFC1035(ip, ep - ip)) static void usage(char **argv)
{
if ((ep - ip) >= (eop - op))
{
*outstr = '\0';
return IDN_ERROR;
}
while (ip < ep)
*op++ = *ip++;
}
else
{
if (!idn_dudeseg(ip, ep, op, eop))
{ {
*outstr = '\0'; fprintf(stderr,
return IDN_ERROR; "%s -e reads code points and writes a DUDE string.\n"
} "%s -d reads a DUDE string and writes code points.\n"
"Input and output are plain text in the native character set.\n"
"Code points are in the form u+hex separated by whitespace.\n"
"A DUDE string is a newline-terminated sequence of LDH characters\n"
"(without any signature).\n"
"The case of the u in u+hex is the force-to-uppercase flag.\n"
, argv[0], argv[0]);
exit(EXIT_FAILURE);
} }
if (op >= eop) // check for output buffer overflow static void fail(const char *msg)
{ {
*outstr = '\0'; fputs(msg,stderr);
return IDN_ERROR; exit(EXIT_FAILURE);
} }
if (ep < eip)
*op++ = *ep; // copy '.'
ip = ep + 1; static const char too_big[] =
} "input or output is too large, recompile with larger limits\n";
static const char invalid_input[] = "invalid input\n";
static const char io_error[] = "I/O error\n";
*op = '\0'; /* The following string is used to convert LDH */
/* characters between ASCII and the native charset: */
return (op - outstr) - 1; static const char ldh_ascii[] =
} "................"
"................"
".............-.."
"0123456789......"
".ABCDEFGHIJKLMNO"
"PQRSTUVWXYZ....."
".abcdefghijklmno"
"pqrstuvwxyz";
/* int main(int argc, char **argv)
idn_DUDEsegtoUTF32 converts instr, DUDE encoded domain name segment
into UTF32.
eip points at the character after the input segment.
ostr points at an output buffer which ends just before eop.
If there is insufficient buffer space, the function return is false.
*/
static int idn_DUDEsegtoUTF32(const char * instr, int inlen,
uchar_t * outstr, int maxlen)
{ {
const char * ip = instr; enum dude_status status;
const char * eip = instr + inlen; int r;
uchar_t * op = outstr; char *p;
uchar_t * eop = op + maxlen - 1;
unsigned prev = 0; if (argc != 2) usage(argv);
if (argv[1][0] != '-') usage(argv);
if (argv[1][2] != 0) usage(argv);
if (argv[1][1] == 'e') {
u_code_point input[unicode_max_length];
unsigned long codept;
unsigned char uppercase_flags[unicode_max_length];
char output[ace_max_size], uplus[3];
unsigned int input_length, output_size, i;
while (ip < eip) /* Read the input code points: */
{
if (*ip == '-')
*op++ = '-';
else
{
char c0 = tolower(*ip);
if ((c0 < 'g') || (c0 > 'v'))
return false;
ip++; input_length = 0;
unsigned r = c0 - 'g'; for (;;) {
int n = 1; r = scanf("%2s%lx", uplus, &codept);
while (ip < eip) if (ferror(stdin)) fail(io_error);
{ if (r == EOF || r == 0) break;
char cl = tolower(*ip);
if ((cl >= '0') && (cl <= '9'))
{
r <<= 4;
r += cl - '0';
}
else if ((cl >= 'a') && (cl <= 'f'))
{
r <<= 4;
r += (cl - 'a') + 10;
}
else
break;
ip++; if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) {
n++; fail(invalid_input);
} }
if (r >= 0x0fffff) if (input_length == unicode_max_length) fail(too_big);
{
return false;
}
unsigned mask = -1 << (n << 2);
unsigned cu = (prev & mask) + r; if (uplus[0] == 'u') uppercase_flags[input_length] = 0;
prev = cu; else if (uplus[0] == 'U') uppercase_flags[input_length] = 1;
else fail(invalid_input);
if (op >= eop) input[input_length++] = codept;
return IDN_ERROR;
*op++ = cu;
}
}
*op = '\0';
return (op - outstr);
} }
int idn_DUDEtoUTF32(const char * in, int inlen, uchar_t * outstr, int outmax) /* Encode: */
{
const char *ip = in;
const char *eip = in + inlen;
const char *ep = ip;
uchar_t *op = outstr;
uchar_t *eop = outstr + outmax - 1;
while (ip < eip)
{
ep = ip;
while ((ep < eip) && (*ep != L'.'))
ep++;
const char * tip = ip;
const char * tagp = DUDETAG;
while (*tagp && (tip < ep) && (tolower(*tagp) == tolower(*tip)))
{
tip++;
tagp++;
}
if (*tagp) output_size = ace_max_size;
{ // tag doesn't match, copy segment verbatim status = dude_encode(input_length, input, uppercase_flags,
while (ip < ep) &output_size, output);
{ if (status == dude_bad_input) fail(invalid_input);
if (op >= eop) if (status == dude_big_output) fail(too_big);
return IDN_ERROR; assert(status == dude_success);
*op++ = *ip++;
}
}
else
{
ip = tip;
int rv = idn_DUDEsegtoUTF32(ip, ep - ip, op, eop - op);
if (rv < 0) /* Convert to native charset and output: */
return IDN_ERROR;
op += rv; for (p = output; *p != 0; ++p) {
i = *p;
assert(i <= 122 && ldh_ascii[i] != '.');
*p = ldh_ascii[i];
} }
*op++ = *ep; r = puts(output);
if (r == EOF) fail(io_error);
if (!*ep) return EXIT_SUCCESS;
break;
ip = ep + 1;
} }
if (op >= eop) if (argv[1][1] == 'd') {
return IDN_ERROR; char input[ace_max_size], scratch[ace_max_size], *pp;
u_code_point output[unicode_max_length];
unsigned char uppercase_flags[unicode_max_length];
unsigned int input_length, output_length, i;
/* Read the DUDE input string and convert to ASCII: */
*op = '\0'; fgets(input, ace_max_size, stdin);
if (ferror(stdin)) fail(io_error);
if (feof(stdin)) fail(invalid_input);
input_length = strlen(input);
if (input[input_length - 1] != '\n') fail(too_big);
input[--input_length] = 0;
return (op - outstr) - 1; for (p = input; *p != 0; ++p) {
pp = strchr(ldh_ascii, *p);
if (pp == 0) fail(invalid_input);
*p = pp - ldh_ascii;
} }
/* /* Decode: */
DUDE test driver
*/
void printres(char *title, int rv, char *buff);
void printres(char *title, int rv, uchar_t *buff);
int main(int argc, char *argv[])
{
char inbuff[512];
while (fgets(inbuff, sizeof(inbuff), stdin))
{
char cbuff[128];
uchar_t wbuff[128];
uchar_t iwbuff[128];
uchar_t *wsp = wbuff;
uchar_t wc;
int in;
int nr;
char * inp = inbuff; output_length = unicode_max_length;
wsp = wbuff; status = dude_decode(test_case_sensitivity, scratch, input,
while (sscanf(inp, "%x%n", &in, &nr) > 0) &output_length, output, uppercase_flags);
{ if (status == dude_bad_input) fail(invalid_input);
inp += nr; if (status == dude_big_output) fail(too_big);
*wsp++ = in; assert(status == dude_success);
}
fprintf(stdout, "\n");
int rv; /* Output the result: */
rv = idn_UTF32toDUDE(wbuff, wsp - wbuff, cbuff, sizeof(cbuff));
printres("toDUDE", rv, cbuff);
if (rv >= 0) for (i = 0; i < output_length; ++i) {
{ r = printf("%s+%04lX\n",
rv = idn_DUDEtoUTF32(cbuff, rv, iwbuff, sizeof(iwbuff)); uppercase_flags[i] ? "U" : "u",
printres("toUTF32", rv, iwbuff); (unsigned long) output[i] );
if (r < 0) fail(io_error);
} }
} return EXIT_SUCCESS;
return 0;
} }
void printres(char *title, int rv, char *buff) usage(argv);
{ return EXIT_SUCCESS; /* not reached, but quiets compiler warning */
fprintf(stdout, "%s (%d) : ", title, rv);
if (rv >= 0)
{
unsigned char *dp = (unsigned char *) buff;
while (*dp)
{
fprintf(stdout, "%c", *dp++);
}
}
fprintf(stdout, "\n");
} }
void printres(char *title, int rv, uchar_t *buff) INTERNET-DRAFT expires 2001-Dec-07
{
fprintf(stdout, "%s (%d) : ", title, rv);
if (rv >= 0)
{
uchar_t *dp = buff;
while (*dp)
{
fprintf(stdout, " %05x", *dp++);
}
}
fprintf(stdout, "\n");
}
 End of changes. 181 change blocks. 
672 lines changed or deleted 641 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/