draft-ietf-idn-mace-00.txt   draft-ietf-idn-mace-01.txt 
Internet Draft M. Ishisone Internet Draft M. Ishisone
draft-ietf-idn-mace-00.txt SRA draft-ietf-idn-mace-01.txt SRA
Jun 21, 2001 Y. Yoneya Jun 28, 2001 Y. Yoneya
Expires Dec 21, 2001 JPNIC Expires Dec 28, 2001 JPNIC
MACE: Modal ASCII Compatible Encoding for IDN MACE: Modal ASCII Compatible Encoding for IDN
Status of this Memo Status of this Memo
This document is an Internet-Draft and is subject to all provisions This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026. of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 10, line ? skipping to change at page 10, line ?
MACE is a reversible transformation method from a sequence of Unicode MACE is a reversible transformation method from a sequence of Unicode
[UNICODE] characters to a sequence of ASCII letters, digits and [UNICODE] characters to a sequence of ASCII letters, digits and
hyphens (LDH characters). It is designed to be used as an encoding hyphens (LDH characters). It is designed to be used as an encoding
for internationalized domain names [IDN]. for internationalized domain names [IDN].
Contents Contents
1. Introduction 1. Introduction
2. Terminology 2. Terminology
3. Overview 3. Overview
4. Base32 format 4. Base32 Format
5. Notations 5. Notations
6. Encoding Description 6. Encoding Description
7. Encoding Procedure 7. Encoding Procedure
8. Decoding Description 8. Decoding Description
9. Decoding Procedure 9. Decoding Procedure
10. ACE Identifier 10. ACE Identifier
11. Examples 11. Examples
Expires December 21th, 2001 [Page 1] Expires December 28th, 2001 [Page 1]
12. Security Considerations 12. Security Considerations
13. References 13. References
14. Acknowlegdements 14. Acknowlegdements
15. Authors' Address 15. Authors' Address
A. Changes from draft-ietf-idn-mace-00
B. Sample Implementation
1. Introduction 1. Introduction
MACE is intended to be used as an ACE in the IDNA architecture MACE is intended to be used as an ACE in the IDNA architecture
[IDNA], and encodes a sequence of Unicode (ISO/IEC 10646) characters [IDNA], and encodes a sequence of Unicode (ISO/IEC 10646) characters
in the range U+0000-U+10FFFF as a sequence of LDH characters. in the range U+0000-U+10FFFF as a sequence of LDH characters.
MACE is designed to have following features: MACE is designed to have following features:
Completeness: Every Unicode string has a map to an LDH character Completeness: Every Unicode string has a map to an LDH character
skipping to change at page 10, line ? skipping to change at page 10, line ?
by "U+" followed by four to six hexadecimal digits representing its by "U+" followed by four to six hexadecimal digits representing its
UCS-4 code point. A range of Unicode characters is denoted by the UCS-4 code point. A range of Unicode characters is denoted by the
form "U+xxxx-U+yyyy". form "U+xxxx-U+yyyy".
3. Overview 3. Overview
MACE encodes a sequence of Unicode (ISO/IEC 10646) characters in the MACE encodes a sequence of Unicode (ISO/IEC 10646) characters in the
range U+0000-U+10FFFF as a sequence of LDH characters. range U+0000-U+10FFFF as a sequence of LDH characters.
MACE is a modal encoding. There are two major modes and one of which MACE is a modal encoding. There are two major modes and one of which
Expires December 28th, 2001 [Page 2]
has four submodes. Each character is encoded in a specific has four submodes. Each character is encoded in a specific
mode/submode. The mode/submode is chosen according to the code point mode/submode. The mode/submode is chosen according to the code point
Expires December 21th, 2001 [Page 2]
of the character and possibly its neiboring characters. The modal of the character and possibly its neiboring characters. The modal
encoding enables compact representation of each character, and the encoding enables compact representation of each character, and the
modes are chosen so that mode change occurs rather infrequently as modes are chosen so that mode change occurs rather infrequently as
long as the source string is written in a single language. long as the source string is written in a single language.
LDH characters are represented literally, for the compactness of the LDH characters are represented literally, for the compactness of the
encoded result. Other Unicode characters are represented as base32 encoded result. Other Unicode characters are represented as base32
format strings. Each of Unicode characters in Basic Multilingual format strings. Each of Unicode characters in Basic Multilingual
Plane (BMP, U+0000-U+FFFF) except LDH characters is encoded as a Plane (BMP, U+0000-U+FFFF) except LDH characters is encoded as a
3-octet base32 format sting, while each non-BMP (U+10000-U+10FFFF) 3-octet base32 format sting, while each non-BMP (U+10000-U+10FFFF)
skipping to change at page 10, line ? skipping to change at page 10, line ?
"7" = 7 = 0x07 = 00111 "n" = 23 = 0x17 = 10111 "7" = 7 = 0x07 = 00111 "n" = 23 = 0x17 = 10111
"8" = 8 = 0x08 = 01000 "o" = 24 = 0x18 = 11000 "8" = 8 = 0x08 = 01000 "o" = 24 = 0x18 = 11000
"9" = 9 = 0x09 = 01001 "p" = 25 = 0x19 = 11001 "9" = 9 = 0x09 = 01001 "p" = 25 = 0x19 = 11001
"a" = 10 = 0x0A = 01010 "q" = 26 = 0x1A = 11010 "a" = 10 = 0x0A = 01010 "q" = 26 = 0x1A = 11010
"b" = 11 = 0x0B = 01011 "r" = 27 = 0x1B = 11011 "b" = 11 = 0x0B = 01011 "r" = 27 = 0x1B = 11011
"c" = 12 = 0x0C = 01100 "s" = 28 = 0x1C = 11100 "c" = 12 = 0x0C = 01100 "s" = 28 = 0x1C = 11100
"d" = 13 = 0x0D = 01101 "t" = 29 = 0x1D = 11101 "d" = 13 = 0x0D = 01101 "t" = 29 = 0x1D = 11101
"e" = 14 = 0x0E = 01110 "u" = 30 = 0x1E = 11110 "e" = 14 = 0x0E = 01110 "u" = 30 = 0x1E = 11110
"f" = 15 = 0x0F = 01111 "v" = 31 = 0x1F = 11111 "f" = 15 = 0x0F = 01111 "v" = 31 = 0x1F = 11111
Expires December 28th, 2001 [Page 3]
The encoding is big-endian (most-significant bits first). The The encoding is big-endian (most-significant bits first). The
following shows some examples. following shows some examples.
Expires December 21th, 2001 [Page 3]
decimal hexadecimal binary base32 string decimal hexadecimal binary base32 string
------------------------------------------------------- -------------------------------------------------------
40 0x28 00001 01000 "18" 40 0x28 00001 01000 "18"
9876 0x2694 01001 10100 10100 "9kk" 9876 0x2694 01001 10100 10100 "9kk"
5. Notations 5. Notations
In the following description, following five functions are used. In the following description, following five functions are used.
base32_encode(N, LEN) base32_encode(N, LEN)
skipping to change at page 10, line ? skipping to change at page 10, line ?
mode/submode. The encoding process of a character is: mode/submode. The encoding process of a character is:
1. Determine the mode/submode to encode the character. 1. Determine the mode/submode to encode the character.
2. If and only if it is necessary to change the current mode, 2. If and only if it is necessary to change the current mode,
output ASCII hyphen-minus to change the mode. output ASCII hyphen-minus to change the mode.
3. If and only if it is necessary to change the current submode, 3. If and only if it is necessary to change the current submode,
output the submode introducer octet (described below) to change output the submode introducer octet (described below) to change
the submode. the submode.
4. Encode the character in the mode/submode. 4. Encode the character in the mode/submode.
Expires December 28th, 2001 [Page 4]
ASCII letter and digit characters are encoded in Literal mode, while ASCII letter and digit characters are encoded in Literal mode, while
non-LDH characters are encoded in Non-Literal mode. ASCII hyphen non-LDH characters are encoded in Non-Literal mode. ASCII hyphen
Expires December 21th, 2001 [Page 4]
character (U+002D) can be encoded in either modes, and is always character (U+002D) can be encoded in either modes, and is always
encoded as a sequence of two hyphen-minus ("--"). Switching between encoded as a sequence of two hyphen-minus ("--"). Switching between
Literal mode and Non-Literal mode is indicated by an ASCII hyphen not Literal mode and Non-Literal mode is indicated by an ASCII hyphen not
followed by another hyphen. The initial mode is Non-Literal. followed by another hyphen. The initial mode is Non-Literal.
In Literal mode, characters are encoded as they are. For example In Literal mode, characters are encoded as they are. For example
ASCII character "a" is encoded as "a". In Non-Literal mode, ASCII character "a" is encoded as "a". In Non-Literal mode,
characters are encoded as a base32 format string. characters are encoded as a base32 format string.
Non-Literal mode further comprises four submodes, `BMP-A', `BMP-B', Non-Literal mode further comprises four submodes, `BMP-A', `BMP-B',
skipping to change at page 10, line ? skipping to change at page 10, line ?
integers of the range 0x0000-0x7fff (15bit integer), then converted integers of the range 0x0000-0x7fff (15bit integer), then converted
to base32 format string using the following scheme: to base32 format string using the following scheme:
submode character range encoding submode character range encoding
----------------------------------------------------------------- -----------------------------------------------------------------
BMP-A U+0000-U+1FFF base32_encode(codepoint(C), 3) BMP-A U+0000-U+1FFF base32_encode(codepoint(C), 3)
U+A000-U+FFFF base32_encode(codepoint(C) - 0x8000, 3) U+A000-U+FFFF base32_encode(codepoint(C) - 0x8000, 3)
BMP-B U+2000-U+9FFF base32_encode(codepoint(C) - 0x2000, 3) BMP-B U+2000-U+9FFF base32_encode(codepoint(C) - 0x2000, 3)
Expires December 21th, 2001 [Page 5] Expires December 28th, 2001 [Page 5]
Here are some examples: Here are some examples:
character submode integer base32 string character submode integer base32 string
--------------------------------------------- ---------------------------------------------
U+00B0 BMP-A 0xb0 "05g" U+00B0 BMP-A 0xb0 "05g"
U+5678 BMP-B 0x3678 "djo" U+5678 BMP-B 0x3678 "djo"
U+BCDE BMP-A 0x3CDE "f6u" U+BCDE BMP-A 0x3CDE "f6u"
Non-BMP submode is used for encoding Unicode characters outside Basic Non-BMP submode is used for encoding Unicode characters outside Basic
Multilingual Plane (U+10000-U+10FFFF). In this mode a character is Multilingual Plane (U+10000-U+10FFFF). In this mode a character is
skipping to change at page 10, line ? skipping to change at page 10, line ?
1. Let PREV be the last non-LDH character before C, and let NXT be 1. Let PREV be the last non-LDH character before C, and let NXT be
the first non-LDH character after C. In case C is the first the first non-LDH character after C. In case C is the first
non-LDH character of the input string, let PREV be U+0000. non-LDH character of the input string, let PREV be U+0000.
2. If xor(codepoint(PREV), codepoint(C)) > 0x1FF, go to 4. 2. If xor(codepoint(PREV), codepoint(C)) > 0x1FF, go to 4.
3. If at least one of the following conditions holds, choose 3. If at least one of the following conditions holds, choose
`Compress'. Otherwise go to 4. `Compress'. Otherwise go to 4.
a) the current submode is `Compress' a) the current submode is `Compress'
b) C is non-BMP character (U+10000-U+10FFFF) b) C is non-BMP character (U+10000-U+10FFFF)
Expires December 21th, 2001 [Page 6] Expires December 28th, 2001 [Page 6]
c) xor(codepoint(PREV), codepoint(C)) is less than 16 c) xor(codepoint(PREV), codepoint(C)) is less than 16
d) NXT exists and xor(codepoint(C), codepoint(NXT)) <= 0x1ff d) NXT exists and xor(codepoint(C), codepoint(NXT)) <= 0x1ff
4. If C is in the range U+0000-U+1FFF or U+A000-U+FFFF, choose 4. If C is in the range U+0000-U+1FFF or U+A000-U+FFFF, choose
`BMP-A'. `BMP-A'.
5. If C is in the range U+2000-U+9FFF, choose `BMP-B'. 5. If C is in the range U+2000-U+9FFF, choose `BMP-B'.
6. Otherwise choose `Non-BMP'. 6. Otherwise choose `Non-BMP'.
Initial state is set as follows. Initial state is set as follows.
mode : Non-Literal mode : Non-Literal
skipping to change at page 10, line ? skipping to change at page 10, line ?
V = V + 0x200 V = V + 0x200
LEN = 2 LEN = 2
else else
LEN = 1 LEN = 1
endif endif
else else
V = codepoint(C) V = codepoint(C)
if (0x0000 <= V <= 0x1FFF) if (0x0000 <= V <= 0x1FFF)
NEW_SUBMODE = `BMP-A' NEW_SUBMODE = `BMP-A'
Expires December 21th, 2001 [Page 7] Expires December 28th, 2001 [Page 7]
LEN = 3 LEN = 3
else if (0xA000 <= V <= 0xFFFF) else if (0xA000 <= V <= 0xFFFF)
NEW_SUBMODE = `BMP-A' NEW_SUBMODE = `BMP-A'
V = V - 0x8000 V = V - 0x8000
LEN = 3 LEN = 3
else if (0x2000 <= V <= 0x9FFF) else if (0x2000 <= V <= 0x9FFF)
NEW_SUBMODE = `BMP-B' NEW_SUBMODE = `BMP-B'
V = V - 0x2000 V = V - 0x2000
LEN = 3 LEN = 3
else else
skipping to change at page 10, line ? skipping to change at page 10, line ?
return (TRUE) return (TRUE)
endif endif
endif endif
return (FALSE) return (FALSE)
end end
8. Decoding Description 8. Decoding Description
Like encoding, MACE decoding process keeps track of the current Like encoding, MACE decoding process keeps track of the current
Expires December 21th, 2001 [Page 8] Expires December 28th, 2001 [Page 8]
mode/submode to decode each character. The initial state for mode/submode to decode each character. The initial state for
decoding is the same as that of encoding. decoding is the same as that of encoding.
mode : Non-Literal mode : Non-Literal
submode : BMP-A submode : BMP-A
PREV : U+0000 PREV : U+0000
Because ASCII domain names are case-insensitive, decoding process Because ASCII domain names are case-insensitive, decoding process
MUST treat uppercase leters and lowercase letters equally. MUST treat uppercase leters and lowercase letters equally.
skipping to change at page 10, line ? skipping to change at page 10, line ?
2 character(xor(P, N - 0x200)) 2 character(xor(P, N - 0x200))
[where N is base32_decode(S), P is codepoint(PREV)] [where N is base32_decode(S), P is codepoint(PREV)]
MACE decoding process can accept invalidly-encoded strings as well. MACE decoding process can accept invalidly-encoded strings as well.
In order to guarantee the unique mapping, following two types of In order to guarantee the unique mapping, following two types of
check must be performed. check must be performed.
1) The decoded string must be checked if it is a [STD13] conforming 1) The decoded string must be checked if it is a [STD13] conforming
name. If it is, decoding process MUST fail. name. If it is, decoding process MUST fail.
Expires December 21th, 2001 [Page 9] Expires December 28th, 2001 [Page 9]
2) The decoded string must be re-encoded and compared to the input 2) The decoded string must be re-encoded and compared to the input
string. If they are not equal (allowing case-difference), string. If they are not equal (allowing case-difference),
decoding process MUST fail. decoding process MUST fail.
9. Decoding Procedure 9. Decoding Procedure
procedure decode(input) procedure decode(input)
MODE = `Non-Literal' MODE = `Non-Literal'
SUBMODE = `BMP-A' SUBMODE = `BMP-A'
PREV = U+0000 PREV = U+0000
skipping to change at page 11, line 46 skipping to change at page 11, line 46
For testing purpose, there is a registry of test prefix strings for For testing purpose, there is a registry of test prefix strings for
ACEs on IETF IDN working group web site [IDN]. ACEs on IETF IDN working group web site [IDN].
11. Examples 11. Examples
The following examples are meaningless strings, but they are designed The following examples are meaningless strings, but they are designed
to exercise various aspects of the algorithm in order to verify the to exercise various aspects of the algorithm in order to verify the
correctness of the implementation. correctness of the implementation.
(a) U+0200 U+4000 U+002D U+B001 U+40001 U+0061 (a) U+0200 U+4000 U+002D U+B001 U+40001 U+0061
MACE: g0x800--wc01y6001-a MACE: 0g0x800--wc01y6001-a
(b) U+0061 U+002D U+0300 U+0062 U+0400 U+3000 U+002D U+5000 (b) U+0061 U+002D U+0300 U+0062 U+0400 U+3000 U+002D U+5000
MACE: -a---0o0-b-100x400--c00 MACE: -a---0o0-b-100x400--c00
(c) U+1FFF U+2000 U+9FFF U+A000 U+FFFF U+10000 U+10FFFF (c) U+1FFF U+2000 U+9FFF U+A000 U+FFFF U+10000 U+10FFFF
MACE: 7vvx000vvvw800vvvy0000vvvv MACE: 7vvx000vvvw800vvvy0000vvvv
(d) U+0200 U+002F U+0030 U+0039 U+003A U+0200 U+0040 U+0041 \ (d) U+0200 U+002F U+0030 U+0039 U+003A U+0200 U+0040 U+0041 \
U+005A U+005B U+0200 U+0060 U+0061 U+007A U+007B U+005A U+005B U+0200 U+0060 U+0061 U+007A U+007B
MACE: 0g001f-09-01q0g0020-AZ-02r0g0030-az-03r MACE: 0g001f-09-01q0g0020-AZ-02r0g0030-az-03r
skipping to change at page 12, line 21 skipping to change at page 12, line 21
(f) U+0100 U+0102 U+0200 U+002D U+0201 U+002D U+03FE U+0061 U+0234 (f) U+0100 U+0102 U+0200 U+002D U+0201 U+002D U+03FE U+0061 U+0234
MACE: zo02w0g0--z1--vv-a-ua MACE: zo02w0g0--z1--vv-a-ua
(g) U+3000 U+002D U+3010 U+0061 U+3100 U+310F U+31FF (g) U+3000 U+002D U+3010 U+0061 U+3100 U+310F U+31FF
MACE: x400--zgg-a-ogfng MACE: x400--zgg-a-ogfng
(h) U+20000 U+002D U+20100 U+0061 U+20010 U+20012 U+200FF (h) U+20000 U+002D U+20100 U+0061 U+20010 U+20012 U+200FF
MACE: y2000--zo0-a-og2nd MACE: y2000--zo0-a-og2nd
12. Security Considerations The following examples are typical Japanese fairly long (15-25
characters) names.
(i) 15 CJK Han characters
<zaiadanhoujinhokkaidoshizenhogosuishinkyoukai>
U+8CA1 U+56E3 U+6CD5 U+4EBA U+5317 U+6D77 U+9053 U+81EA \
U+7136 U+4FDD U+8B77 U+63A8 U+9032 U+5354 U+4F1A
MACE: xr51dn3j6lblqconjbns2jofak9mbutqrngt8s1icqkboq
(j) 4 Digits, 2 CJK Han, 1 Hiragana, 6 CJK Han, 6 Katakana characters
2001<nenharu><no><koutsujikobokumetsu><kyanpe-n>
U+0032 U+0030 U+0030 U+0031 U+5E74 U+6625 U+306E U+4EA4 \
U+901A U+4E8B U+6545 U+64B2 U+6EC5 U+30AD U+30E3 U+30F3 \
U+30DA U+30FC U+30F3
MACE: -2001-xfjkhh543ebl4s0qbkbha5h5ijm545dzieggh9h6f
(k) 9 CJK Han, 9 Katakana characters
<saitamarinkaikaiyohakubutsukan><marinmyu-jiamu>
U+57FC U+7389 U+81E8 U+6D77 U+6D77 U+6D0B U+535A U+7269 \
U+9928 U+30DE U+30EA U+30F3 U+30DF U+30E5 U+30FC U+30B8 \
U+30A2 U+30E0
MACE: xdvsks9of8jbnz0jsxcqqkj9u9846uzhkgphchqgpi4gqi2
(l) 6 CJK Han, 19 Katakana characters
<shadanhoujinnippon><nettowa-kuinfome-shonsenta->
U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3 \
U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9 \
U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF \
U+30FC
MACE: xm9udn3j6lblqhf5hpc46dzebh7gjijbinh6jsi8gtibiggki8i8ici3
12. Security Considerations
Users expect each domain name in DNS to be controlled by a single Users expect each domain name in DNS to be controlled by a single
authority. If a Unicode string intended for use as a domain label authority. If a Unicode string intended for use as a domain label
could map to multiple ACE labels, then an internationalized domain could map to multiple ACE labels, then an internationalized domain
name could map to multiple ACE domain names, each controlled by a name could map to multiple ACE domain names, each controlled by a
different authority, some of which could be spoofs that hijack different authority, some of which could be spoofs that hijack
service requests intended for another. Therefore MACE is designed so service requests intended for another. Therefore MACE is designed so
that each Unicode string has a unique encoding. that each Unicode string has a unique encoding.
13. References 13. References
skipping to change at line 645 skipping to change at page 14, line 24
Software Research Associates, Inc. Software Research Associates, Inc.
4-16-10, Chigasaki-Minami, Tsuzuki-ku, Yokohama, 4-16-10, Chigasaki-Minami, Tsuzuki-ku, Yokohama,
Kanagawa 224-0037 Japan Kanagawa 224-0037 Japan
<ishisone@sra.co.jp> <ishisone@sra.co.jp>
Yoshiro Yoneya Yoshiro Yoneya
Japan Network Information Center (JPNIC) Japan Network Information Center (JPNIC)
Fuundo Bldg 1F, 1-2 Kanda-ogawamachi, Fuundo Bldg 1F, 1-2 Kanda-ogawamachi,
Chiyoda-ku Tokyo 101-0052, Japan Chiyoda-ku Tokyo 101-0052, Japan
<yone@nic.ad.jp> <yone@nic.ad.jp>
A. Changes from draft-ietf-idn-mace-00
1) A typo in example a) is fixed.
2) More examples are added.
3) A sample implementation is included as an appendix.
B. Sample Implementation
/*
* MACE encoder/decoder sample implementation.
*
* For brevity, this code assumes it is written in ASCII code (or
* its superset).
*
* Option -e encodes the input Unicode characters (standard U+XXXX
* notation) and output MACE-encoded string.
* Option -d decodes MACE-encoded string and output Unicode characters
* (also in U+XXXX notation).
*/
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
/* UCS-4 code point type */
typedef unsigned long mace_ucs_t;
/* Encode/decode status */
typedef enum {
mace_success, /* succeeded */
mace_overflow, /* buffer overflow */
mace_invalid_input, /* input string is invalid */
mace_nomemory /* malloc failed */
} mace_result_t;
extern mace_result_t
mace_encode(const mace_ucs_t *from, size_t from_len, char *to,
size_t to_size, size_t *to_lenp);
extern mace_result_t
mace_decode(const char *from, size_t from_len, mace_ucs_t *to,
size_t to_size, size_t *to_lenp);
/* Major mode and submode */
typedef enum { Literal, Non_Literal } mace_mode_t;
typedef enum { BMP_A, BMP_B, Non_BMP, Compress } mace_submode_t;
/* Submode introducer octets. */
static int submode_introducer[] = { 'w', 'x', 'y', 'z' };
/* Encode length for a character in each submode. */
/* For Comperss submode it is actually either 1 or 2. */
static int submode_encodelen[] = { 3, 3, 4, 1 };
#define LDH_CHARACTER(c) \
(('a' <= (c) && (c) <= 'z') || ('A' <= (c) && (c) <= 'Z') || \
('0' <= (c) && (c) <= '9') || (c) == '-')
#define LOWER_CHAR(c) \
(('A' <= (c) && (c) <= 'Z') ? (((c) - 'A') + 'a') : (c))
static int
conform_to_std13(mace_ucs_t *s, size_t len)
{
/*
* Check if Unicode string S whose length is LEN conforms
* to the host name part specification by STD13.
*/
size_t idx;
for (idx = 0; idx < len; idx++) {
if (!LDH_CHARACTER(s[idx])) return (0);
}
return (s[0] != '-' && s[len - 1] != '-');
}
static void
base32_encode(mace_ucs_t v, size_t len, char *s)
{
/*
* Convert non-negative integer V to the corresponding
* base32 format string of LEN octets and store to S.
* Caller must check S is big enough to hold LEN octets beforehand.
*/
static char *b32 = "0123456789abcdefghijklmnopqrstuv";
int idx;
for (idx = len - 1; idx >= 0; idx--) {
s[idx] = b32[v & 0x1F];
v >>= 5;
}
}
static int
base32_decode(const char *s, size_t len, mace_ucs_t *vp)
{
/*
* Convert a base32 string S of LEN octets long to
* a non-negative integer, store it to *VP and return 1.
* If S is not a valid base32 string, return 0.
*/
mace_ucs_t v = 0;
int x;
while (len-- > 0) {
if ('0' <= *s && *s <= '9') x = *s++ - '0';
else if ('a' <= *s && *s <= 'v') x = *s++ - 'a' + 10;
else if ('A' <= *s && *s <= 'V') x = *s++ - 'A' + 10;
else return 0;
v = (v << 5) + x;
}
*vp = v;
return 1;
}
static int
round_trip_check(const mace_ucs_t *u, size_t ulen,
const char *a, size_t alen)
{
/*
* Encode Unicode string U whose length is ULEN and compare the
* result with string A of length ALEN. If the two are same
* (allowing case-difference), return mace_success. Otherwise
* return mace_invalid_input or mace_nomemory (if malloc failed).
*/
char *check;
size_t reallen, idx;
if ((check = malloc(alen)) == NULL) return mace_nomemory;
if (mace_encode(u, ulen, check, alen, &reallen) != mace_success ||
reallen != alen)
goto invalid;
for (idx = 0; idx < alen; idx++) {
if (LOWER_CHAR(a[idx]) != LOWER_CHAR(check[idx])) goto invalid;
}
free(check);
return mace_success;
invalid:
free(check);
return mace_invalid_input;
}
static int
compressible(mace_submode_t submode, mace_ucs_t prev, mace_ucs_t c,
const mace_ucs_t *rest, size_t rest_len)
{
/*
* Determin whether the Unicode character C should be
* encoded in Compress submode or not.
*/
int idx;
if ((c ^ prev) > 0x1FF) return 0;
if (submode == Compress || c >= 0x10000 || (c ^ prev) < 16)
return 1;
/* Find the next non-LDH character */
for (idx = 0; idx < rest_len; idx++) {
if (!LDH_CHARACTER(rest[idx])) break;
}
return (idx < rest_len && (c ^ rest[idx]) <= 0x1FF);
}
mace_result_t
mace_encode(const mace_ucs_t *from, size_t from_len,
char *to, size_t to_size, size_t *to_lenp)
{
/*
* Encode a Unicode string FROM whose length is FROM_LEN and store
* the result to TO, whose allocated length is TO_SIZE. The
* length of the result string is stored to *TO_LENP. Note that
* TO will not be terminated by NUL character.
*/
mace_mode_t mode = Non_Literal;
mace_submode_t submode = BMP_A;
mace_ucs_t prev = 0;
const mace_ucs_t *from_ptr = from;
size_t from_rest = from_len, to_idx = 0, len;
mace_ucs_t c, v;
#define OUTPUT(c) \
if (to_idx >= to_size) return mace_overflow; to[to_idx++] = (c)
while (from_rest > 0) {
c = *from_ptr++;
from_rest--;
/* Perform range check. */
if (c > 0x10FFFF) return mace_invalid_input;
if (c == '-') {
OUTPUT('-'); OUTPUT('-');
} else if (LDH_CHARACTER(c)) {
if (mode != Literal) {
/* Switch to Literal mode. */
OUTPUT('-');
mode = Literal;
}
OUTPUT(c);
} else {
mace_submode_t new_submode;
if (mode != Non_Literal) {
/* Switch to Non-Literal mode. */
OUTPUT('-');
mode = Non_Literal;
}
if (compressible(submode, prev, c, from_ptr, from_rest)) {
/* Compress submode */
new_submode = Compress;
v = prev ^ c;
len = 1;
if (v >= 16) {
v += 0x200;
len = 2;
}
} else {
/* Choose the right submode based on the code point. */
if ((0x0000 <= c && c <= 0x1FFF) ||
(0xA000 <= c && c <= 0xFFFF)) {
new_submode = BMP_A;
v = c - (c >= 0xA000 ? 0x8000 : 0);
} else if (0x2000 <= c && c <= 0x9FFF) {
new_submode = BMP_B;
v = c - 0x2000;
} else {
new_submode = Non_BMP;
v = c - 0x10000;
}
len = submode_encodelen[new_submode];
}
if (new_submode != submode) {
/* Shift to the new submode. */
OUTPUT(submode_introducer[new_submode]);
submode = new_submode;
}
/* Remember the last non-LDH character. */
prev = c;
/* Convert to base32 format string. */
if (to_idx + len > to_size) return mace_overflow;
base32_encode(v, len, &to[to_idx]);
to_idx += len;
}
}
#undef OUTPUT
*to_lenp = to_idx;
return mace_success;
}
mace_result_t
mace_decode(const char *from, size_t from_len,
mace_ucs_t *to, size_t to_size, size_t *to_lenp)
{
/*
* Decode a MACE-encoded string FROM whose length is FROM_LEN
* and store the result to TO, whose allocated length is TO_SIZE.
* The length of the result string is stored to *TO_LENP.
*/
mace_mode_t mode = Non_Literal;
mace_submode_t submode = BMP_A;
mace_ucs_t prev = 0, v;
const char *from_ptr = from;
size_t from_rest = from_len, to_idx = 0;
int c;
#define OUTPUT(c) \
if (to_idx >= to_size) return mace_overflow; to[to_idx++] = (c)
while (from_rest > 0) {
c = *from_ptr++;
from_rest--;
if (c == '-') {
if (from_rest > 0 && from_ptr[0] == '-') {
OUTPUT('-');
from_ptr++, from_rest--;
} else {
mode = (mode == Literal) ? Non_Literal : Literal;
}
} else if (mode == Literal) {
OUTPUT(c);
} else if (c == 'w' || c == 'W') {
submode = BMP_A;
} else if (c == 'x' || c == 'X') {
submode = BMP_B;
} else if (c == 'y' || c == 'Y') {
submode = Non_BMP;
} else if (c == 'z' || c == 'Z') {
submode = Compress;
} else {
int encode_len = submode_encodelen[submode];
from_ptr--, from_rest++; /* push back C */
if (from_rest < encode_len ||
base32_decode(from_ptr, encode_len, &v) == 0)
return mace_invalid_input;
if (submode == BMP_A) {
if (v >= 0x2000) v += 0x8000;
} else if (submode == BMP_B) {
v += 0x2000;
} else if (submode == Non_BMP) {
v += 0x10000;
} else { /* Compress */
if (v >= 16) {
encode_len = 2;
if (from_rest < encode_len ||
base32_decode(from_ptr, encode_len, &v) == 0)
return mace_invalid_input;
v -= 0x200;
}
v ^= prev;
}
OUTPUT(v);
prev = v;
from_ptr += encode_len;
from_rest -= encode_len;
}
}
#undef OUTPUT
*to_lenp = to_idx;
if (conform_to_std13(to, to_idx)) return mace_invalid_input;
return (round_trip_check(to, to_idx, from, from_len));
}
/******* Test Driver **************************************************/
static void
error(const char *msg)
{
fprintf(stderr, "%s\n", msg);
exit(1);
}
static void
mace_error(const char *s, mace_result_t r)
{
static char *emsg[] = {
"no error", "buffer overflowed",
"input string is invalid", "malloc failed",
};
fprintf(stderr, "%s: %s\n", s, emsg[r]);
exit(1);
}
int
main(int ac, char **av)
{
char *cmd = *av;
char line[256];
mace_ucs_t ucs[64];
mace_result_t r;
size_t len, ucslen, i;
int encode = 1;
if ('a' != 0x61) error("oops. not ASCII code (EBCDIC?)");
if (ac > 2) {
usage:
fprintf(stderr, "Usage: %s [-e|-d]\n", cmd);
return 1;
} if (ac == 2) {
if (!strcmp(av[1], "-e")) encode = 1;
else if (!strcmp(av[1], "-d")) encode = 0;
else goto usage;
}
while (fgets(line, sizeof(line), stdin) != NULL) {
if (encode) {
char *p = line, *nxt;
int idx = 0;
while (idx < 64) {
while (isspace((unsigned char)*p)) p++;
if (*p == '\0') break;
if (strncmp(p, "U+", 2) != 0)
error("invalid input format");
ucs[idx++] = strtoul(p + 2, &nxt, 16);
if (nxt == p + 2) error("invalid input format");
p = nxt;
}
if (idx >= 64) error("input too long");
r = mace_encode(ucs, idx, line, 255, &len);
if (r != mace_success) mace_error("mace_encode", r);
printf("%1.*s\n", (int)len, line);
} else {
len = strlen(line) - 1; /* 1 for newline */
r = mace_decode(line, len, ucs, 64, &ucslen);
if (r != mace_success) mace_error("mace_decode", r);
for (i = 0; i < ucslen; i++) {
printf("U+%04lX ", ucs[i]);
}
printf("\n");
}
}
return 0;
}
 End of changes. 19 change blocks. 
17 lines changed or deleted 48 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/