draft-ietf-idn-amc-ace-z-00.txt | draft-ietf-idn-amc-ace-z-01.txt | |||
---|---|---|---|---|

INTERNET-DRAFT Adam M. Costello | INTERNET-DRAFT Adam M. Costello | |||

draft-ietf-idn-amc-ace-z-00.txt 2001-Aug-16 | draft-ietf-idn-amc-ace-z-01.txt 2001-Sep-04 | |||

Expires 2002-Feb-16 | Expires 2002-Mar-04 | |||

AMC-ACE-Z version 0.3.0 | AMC-ACE-Z version 0.3.1 | |||

Status of this Memo | Status of this Memo | |||

This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||

all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||

Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||

Task Force (IETF), its areas, and its working groups. Note | Task Force (IETF), its areas, and its working groups. Note | |||

that other groups may also distribute working documents as | that other groups may also distribute working documents as | |||

Internet-Drafts. | Internet-Drafts. | |||

skipping to change at line 64 | skipping to change at line 64 | |||

3.1 Basic code point segregation | 3.1 Basic code point segregation | |||

3.2 Insertion unsort coding | 3.2 Insertion unsort coding | |||

3.3 Generalized variable-length integers | 3.3 Generalized variable-length integers | |||

3.4 Bias adaptation | 3.4 Bias adaptation | |||

4. Bootstring parameters | 4. Bootstring parameters | |||

5. Parameter values for AMC-ACE-Z | 5. Parameter values for AMC-ACE-Z | |||

6. Bootstring algorithms | 6. Bootstring algorithms | |||

6.1 Bias adaptation function | 6.1 Bias adaptation function | |||

6.2 Decoding procedure | 6.2 Decoding procedure | |||

6.3 Encoding procedure | 6.3 Encoding procedure | |||

6.4 Alternative methods for handling overflow | ||||

7. AMC-ACE-Z example strings | 7. AMC-ACE-Z example strings | |||

8. Security considerations | 8. Security considerations | |||

9. References | 9. References | |||

A. Author contact information | A. Author contact information | |||

B. Mixed-case annotation | B. Mixed-case annotation | |||

C. AMC-ACE-Z sample implementation | C. AMC-ACE-Z sample implementation | |||

1. Introduction | 1. Introduction | |||

The IDNA draft [IDNA] describes an architecture for supporting | The IDNA draft [IDNA] describes an architecture for supporting | |||

skipping to change at line 122 | skipping to change at line 123 | |||

AMC-ACE-Z is a working name that should be changed if it is adopted. | AMC-ACE-Z is a working name that should be changed if it is adopted. | |||

(The Z merely indicates that it is the twenty-sixth ACE devised by | (The Z merely indicates that it is the twenty-sixth ACE devised by | |||

this author. Most were not worth releasing.) | this author. Most were not worth releasing.) | |||

2. Terminology | 2. Terminology | |||

The key words "must", "shall", "required", "should", "recommended", | The key words "must", "shall", "required", "should", "recommended", | |||

and "may" in this document are to be interpreted as described in RFC | and "may" in this document are to be interpreted as described in RFC | |||

2119 [RFC2119]. | 2119 [RFC2119]. | |||

A "code point" is an integral value associated with a character in a | ||||

coded character set. | ||||

As in the Unicode Standard [UNICODE], Unicode code points are | As in the Unicode Standard [UNICODE], Unicode code points are | |||

denoted by "U+" followed by four to six hexadecimal digits, while a | denoted by "U+" followed by four to six hexadecimal digits, while a | |||

range of code points is denoted by two hexadecimal numbers separated | range of code points is denoted by two hexadecimal numbers separated | |||

by "..", with no prefixes. | by "..", with no prefixes. | |||

The operators div and mod perform integer division; (x div y) is the | The operators div and mod perform integer division; (x div y) is the | |||

quotient of x divided by y, discarding the remainder, and (x mod y) | quotient of x divided by y, discarding the remainder, and (x mod y) | |||

is the remainder, so (x div y) * y + (x mod y) == x. Bootstring | is the remainder, so (x div y) * y + (x mod y) == x. Bootstring | |||

uses these operators only with nonnegative operands, so the quotient | uses these operators only with nonnegative operands, so the quotient | |||

and remainder are always nonnegative. | and remainder are always nonnegative. | |||

The ?: operator is a conditional; (x ? y : z) means y if x is true, | ||||

z if x is false. It is just like "if x then y else z" except that y | ||||

and z are expressions rather than statements. | ||||

The "break" statement jumps out of the innermost loop (as in C). | The "break" statement jumps out of the innermost loop (as in C). | |||

An "overflow" is an attempt to compute a value that exceeds the | ||||

maximum value of an integer variable. | ||||

3. Bootstring description | 3. Bootstring description | |||

Bootstring represents an arbitrary sequence of code points (the | Bootstring represents an arbitrary sequence of code points (the | |||

"extended string") as a sequence of basic code points (the | "extended string") as a sequence of basic code points (the | |||

"basic string"). This section describes the representation. | "basic string"). This section describes the representation. | |||

Section 6 "Bootstring algorithms" presents the algorithms as | Section 6 "Bootstring algorithms" presents the algorithms as | |||

pseudocode. | pseudocode. | |||

3.1 Basic code point segregation | 3.1 Basic code point segregation | |||

skipping to change at line 255 | skipping to change at line 258 | |||

integral value. | integral value. | |||

Bootstring uses little-endian ordering so that the deltas can be | Bootstring uses little-endian ordering so that the deltas can be | |||

separated starting with the first. The t(j) values are defined in | separated starting with the first. The t(j) values are defined in | |||

terms of the constants base, tmin, and tmax, and a state variable | terms of the constants base, tmin, and tmax, and a state variable | |||

called bias: | called bias: | |||

t(j) = base * (j + 1) - bias, | t(j) = base * (j + 1) - bias, | |||

clamped to the range tmin through tmax | clamped to the range tmin through tmax | |||

(The clamping means that if the formula yields a value less than | The clamping means that if the formula yields a value less than tmin | |||

tmin or greater than tmax, then t(j) = tmin or tmax, respectively.) | or greater than tmax, then t(j) = tmin or tmax, respectively. (In | |||

These t(j) values cause the representation to favor integers within | the pseudocode in section 6 "Bootstring algorithms", the expression | |||

a particular range determined by the bias. | base * (j + 1) is denoted by k for performance reasons.) These | |||

t(j) values cause the representation to favor integers within a | ||||

particular range determined by the bias. | ||||

3.4 Bias adaptation | 3.4 Bias adaptation | |||

After each delta is encoded or decoded, bias is set for the next | After each delta is encoded or decoded, bias is set for the next | |||

delta as follows: | delta as follows: | |||

1. Delta is scaled in order to avoid overflow in the next step: | 1. Delta is scaled in order to avoid overflow in the next step: | |||

let delta = delta div 2 | let delta = delta div 2 | |||

skipping to change at line 308 | skipping to change at line 313 | |||

expected to be last, tmin for the less significant digits up through | expected to be last, tmin for the less significant digits up through | |||

the one expected to be third-last, and somewhere between tmin and | the one expected to be third-last, and somewhere between tmin and | |||

tmax for the digit expected to be second-last (balancing the hope of | tmax for the digit expected to be second-last (balancing the hope of | |||

the expected-last digit being unnecessary against the danger of it | the expected-last digit being unnecessary against the danger of it | |||

being insufficient). | being insufficient). | |||

4. Bootstring parameters | 4. Bootstring parameters | |||

Given a set of basic code points, one must be designated as | Given a set of basic code points, one must be designated as | |||

the delimiter. The base can be no greater than the number of | the delimiter. The base can be no greater than the number of | |||

distinguishable basic code points remaining. The values 0 through | distinguishable basic code points remaining. The digit-values | |||

base-1 must be associated with non-delimiter basic code points. | in the range 0 through base-1 must be associated with distinct | |||

In some cases multiple code points must represent the same value; | non-delimiter basic code points. In some cases multiple code points | |||

for example, uppercase and lowercase versions of a letter must be | must have the same digit-value; for example, uppercase and lowercase | |||

equivalent if basic strings are case-insensitive. | versions of the same letter must be equivalent if basic strings are | |||

case-insensitive. | ||||

The initial value of n must be no greater than the minimum non-basic | The initial value of n must be no greater than the minimum non-basic | |||

code point that could appear in extended strings. | code point that could appear in extended strings. | |||

The remaining five parameters (tmin, tmax, skew, damp, and the | The remaining five parameters (tmin, tmax, skew, damp, and the | |||

initial value of bias) must satisfy the following constraints: | initial value of bias) must satisfy the following constraints: | |||

0 <= tmin <= tmax <= base-1 | 0 <= tmin <= tmax <= base-1 | |||

skew >= 1 | skew >= 1 | |||

damp >= 2 | damp >= 2 | |||

skipping to change at line 342 | skipping to change at line 348 | |||

5. Parameter values for AMC-ACE-Z | 5. Parameter values for AMC-ACE-Z | |||

AMC-ACE-Z uses the following Bootstring parameter values: | AMC-ACE-Z uses the following Bootstring parameter values: | |||

base = 36 | base = 36 | |||

tmin = 1 | tmin = 1 | |||

tmax = 26 | tmax = 26 | |||

skew = 38 | skew = 38 | |||

damp = 700 | damp = 700 | |||

initial_bias = 72 | initial_bias = 72 | |||

initial_n = U+0080 | initial_n = 0x80 | |||

In AMC-ACE-Z, code points are Unicode code points [UNICODE], that | In AMC-ACE-Z, code points are Unicode code points [UNICODE], that | |||

is, integers in the range 0..10FFFF, but not D800..DFFF, which are | is, integers in the range 0..10FFFF, but not D800..DFFF, which are | |||

reserved for use by UTF-16. The basic code points are the ASCII | reserved for use by UTF-16. The basic code points are the ASCII | |||

code points (0..7F), some of which have values associated with them: | code points (0..7F), of which U+002D (-) is the delimiter, and some | |||

of the others have digit-values as follows: | ||||

U+002D (-) = delimiter | code points digit-values | |||

------------ ---------------------- | ||||

41..5A (A-Z) = 0 to 25, respectively | 41..5A (A-Z) = 0 to 25, respectively | |||

61..7A (a-z) = 0 to 25, respectively | 61..7A (a-z) = 0 to 25, respectively | |||

30..39 (0-9) = 26 to 35, respectively | 30..39 (0-9) = 26 to 35, respectively | |||

Using hyphen-minus as the delimiter implies that the ACE can end | Using hyphen-minus as the delimiter implies that the ACE can end | |||

with a hyphen-minus only if the Unicode string consists entirely | with a hyphen-minus only if the Unicode string consists entirely | |||

of basic code points, but IDNA forbids such strings from being | of basic code points, but IDNA forbids such strings from being | |||

ACE-encoded. Furthermore, the ACE can begin with a hyphen-minus | ACE-encoded. Furthermore, the ACE can begin with a hyphen-minus | |||

only if the Unicode string does, which is forbidden by IDNA. | only if the Unicode string does, which is forbidden by IDNA. | |||

Therefore IDNA using AMC-ACE-Z, regardless of whether the signature | Therefore IDNA using AMC-ACE-Z, regardless of whether the signature | |||

skipping to change at line 390 | skipping to change at line 398 | |||

Such ambiguities are usually resolved by context, but in an ACE | Such ambiguities are usually resolved by context, but in an ACE | |||

there is no context apparent to humans. | there is no context apparent to humans. | |||

6. Bootstring algorithms | 6. Bootstring algorithms | |||

Some parts of the pseudocode can be omitted if the parameters | Some parts of the pseudocode can be omitted if the parameters | |||

satisfy certain conditions (for which AMC-ACE-Z qualifies). These | satisfy certain conditions (for which AMC-ACE-Z qualifies). These | |||

parts are enclosed in {braces}, and notes immediately following the | parts are enclosed in {braces}, and notes immediately following the | |||

pseudocode explain the conditions under which they may be omitted. | pseudocode explain the conditions under which they may be omitted. | |||

Formally, code points are integers, and hence the pseudocode assumes | ||||

that arithmetic operations can be performed directly on code points. | ||||

Some actual programming languages might require explicit conversion | ||||

between code points and integers. | ||||

6.1 Bias adaptation function | 6.1 Bias adaptation function | |||

function adapt(delta,numpoints,firsttime): | function adapt(delta,numpoints,firsttime): | |||

let delta = delta div (firsttime ? damp : 2) | if firsttime then let delta = delta div damp | |||

else let delta = delta div 2 | ||||

let delta = delta + (delta div numpoints) | let delta = delta + (delta div numpoints) | |||

let k = 0 | let k = 0 | |||

while delta > ((base - tmin) * tmax) div 2 | while delta > ((base - tmin) * tmax) div 2 do begin | |||

do let delta = delta div (base - tmin) and let k = k + base | let delta = delta div (base - tmin) | |||

let k = k + base | ||||

end | ||||

return k + (((base - tmin + 1) * delta) div (delta + skew)) | return k + (((base - tmin + 1) * delta) div (delta + skew)) | |||

It does not matter whether the modifications to delta and k | ||||

inside adapt() affect variables of the same name inside the | ||||

encoding/decoding procedures, because after calling adapt() the | ||||

caller does not read those variables before overwriting them. | ||||

6.2 Decoding procedure | 6.2 Decoding procedure | |||

let n = initial_n | let n = initial_n | |||

let i = 0 | let i = 0 | |||

let bias = initial_bias | let bias = initial_bias | |||

let output = an empty string indexed from 0 | let output = an empty string indexed from 0 | |||

consume all code points before the last delimiter (if there is one) | consume all code points before the last delimiter (if there is one) | |||

and copy them to output, fail on any non-basic code point | and copy them to output, fail on any non-basic code point | |||

if more than zero code points were consumed then consume one more | if more than zero code points were consumed then consume one more | |||

(which will be the last delimiter) | ||||

while the input is not exhausted do begin | while the input is not exhausted do begin | |||

let oldi = i | let oldi = i | |||

let w = 1 | let w = 1 | |||

for k = base to infinity in steps of base do begin | for k = base to infinity in steps of base do begin | |||

consume a code point, fail on end-of-input | consume a code point, or fail if there was none to consume | |||

let digit = the code point's value, fail if it has no value | let digit = the code point's digit-value, fail if it has none | |||

let i = i + digit * w, fail on overflow | let i = i + digit * w, fail on overflow | |||

let t = k <= bias ? tmin : k - bias > tmax ? tmax : k - bias | let t = tmin if k <= bias, tmax if k >= bias + tmax, or | |||

k - bias otherwise | ||||

if digit < t then break | if digit < t then break | |||

let w = w * (base - t), fail on overflow | let w = w * (base - t), fail on overflow | |||

end | end | |||

let bias = adapt(i - oldi, length(output) + 1, oldi == 0) | let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) | |||

let n = n + i div (length(output) + 1), fail on overflow | let n = n + i div (length(output) + 1), fail on overflow | |||

let i = i mod (length(output) + 1) | let i = i mod (length(output) + 1) | |||

{if n is a basic code point then fail} | {if n is a basic code point then fail} | |||

insert n into output at position i | insert n into output at position i | |||

increment i | increment i | |||

end | end | |||

The statement enclosed in braces (checking whether n is a basic | The statement enclosed in braces (checking whether n is a basic | |||

code point) may be omitted if initial_n exceeds all basic code | code point) may be omitted if initial_n exceeds all basic code | |||

points (which is true for AMC-ACE-Z), because n is never less than | points (which is true for AMC-ACE-Z), because n is never less than | |||

initial_n. | initial_n. | |||

Because the decoder state can only advance monotonically, and there | Because the decoder state can only advance monotonically, and there | |||

is only one representation of any delta, there is therefore only | is only one representation of any delta, there is therefore only | |||

one encoded string that can represent a given sequence of integers. | one encoded string that can represent a given sequence of integers. | |||

The only error conditions are invalid code points, unexpected | The only error conditions are invalid code points, unexpected | |||

end-of-input, overflow (attempts to compute values that exceed the | end-of-input, overflow, and basic code points encoded using deltas | |||

maximum value of an integer variable), and basic code points encoded | instead of appearing literally. If the decoder fails on these | |||

using deltas instead of appearing literally. If the decoder fails | errors as shown above, then it cannot produce the same output for | |||

on these errors as shown above, then it cannot produce the same | two distinct inputs, and hence it need not re-encode its output to | |||

output for two distinct inputs, and hence it need not re-encode its | verify that it matches the input. | |||

output to verify that it matches the input. | ||||

The assignment of t, where t is clamped to the range tmin through | The assignment of t, where t is clamped to the range tmin through | |||

tmax, does not handle the case where bias < k < bias + tmin, but | tmax, does not handle the case where bias < k < bias + tmin, but | |||

that is impossible because of the way bias is computed and because | that is impossible because of the way bias is computed and because | |||

of the constraints on the parameters. | of the constraints on the parameters. | |||

If the programming language does not provide overflow detection, | If the programming language does not provide overflow detection, | |||

the following technique can be used. Suppose A, B, and C are | the following technique can be used. Suppose A, B, and C are | |||

representable nonnegative integers and C is nonzero. Then A + B | representable nonnegative integers and C is nonzero. Then A + B | |||

overflows if and only if B > maxint - A, and A + (B * C) overflows | overflows if and only if B > maxint - A, and A + (B * C) overflows | |||

if and only if B > (maxint - A) div C. See appendix C "AMC-ACE-Z | if and only if B > (maxint - A) div C, where maxint is the greatest | |||

sample implementation" for demonstrations of this technique. | integer for which maxint + 1 cannot be represented. Refer to | |||

appendix C "AMC-ACE-Z sample implementation" for demonstrations of | ||||

this technique in the C language. See also section 6.4 "Alternative | ||||

methods for handling overflow". | ||||

6.3 Encoding procedure | 6.3 Encoding procedure | |||

let n = initial_n | let n = initial_n | |||

let delta = 0 | let delta = 0 | |||

let bias = initial_bias | let bias = initial_bias | |||

let h = b = the number of basic code points in the input | let h = b = the number of basic code points in the input | |||

copy them to the output in order, followed by a delimiter if b > 0 | copy them to the output in order, followed by a delimiter if b > 0 | |||

{if the input contains a non-basic code point < n then fail} | {if the input contains a non-basic code point < n then fail} | |||

while h < length(input) do begin | while h < length(input) do begin | |||

let m = the minimum {non-basic} code point >= n in the input | let m = the minimum {non-basic} code point >= n in the input | |||

let delta = delta + (m - n) * (h + 1), fail on overflow | let delta = delta + (m - n) * (h + 1), fail on overflow | |||

let n = m | let n = m | |||

for each code point m in the input (in order) do begin | for each code point c in the input (in order) do begin | |||

if m < n {or m is basic} then increment delta, fail on overflow | if c < n {or c is basic} then increment delta, fail on overflow | |||

if m == n then begin | if c == n then begin | |||

let q = delta | let q = delta | |||

for k = base to infinity in steps of base do begin | for k = base to infinity in steps of base do begin | |||

let t = k <= bias ? tmin : k - bias > tmax ? tmax : k - bias | let t = tmin if k <= bias, tmax if k >= bias + tmax, or | |||

k - bias otherwise | ||||

if q < t then break | if q < t then break | |||

output the code point for digit t + ((q - t) mod (base - t)) | output the code point for digit t + ((q - t) mod (base - t)) | |||

let q = (q - t) div (base - t) | let q = (q - t) div (base - t) | |||

end | end | |||

output the code point for digit q | output the code point for digit q | |||

let bias = adapt(delta, h + 1, h == b) | let bias = adapt(delta, h + 1, test h equals b?) | |||

let delta = 0 | let delta = 0 | |||

increment h | increment h | |||

end | end | |||

end | end | |||

increment delta and n | increment delta and n | |||

end | end | |||

The full statement enclosed in braces (checking whether the input | The full statement enclosed in braces (checking whether the input | |||

contains a non-basic code point less than n) can be omitted if all | contains a non-basic code point less than n) can be omitted if all | |||

code points less than initial_n are basic code points (which is true | code points less than initial_n are basic code points (which is true | |||

for AMC-ACE-Z if code points are unsigned). | for AMC-ACE-Z if code points are unsigned). | |||

The brace-enclosed conditions "non-basic" and "or m is basic" can be | The brace-enclosed conditions "non-basic" and "or m is basic" can be | |||

omitted if initial_n exceeds all basic code points (which is true | omitted if initial_n exceeds all basic code points (which is true | |||

for AMC-ACE-Z), because the code point being tested is never less | for AMC-ACE-Z), because the code point being tested is never less | |||

than initial_n. | than initial_n. | |||

skipping to change at line 497 | skipping to change at line 525 | |||

code points less than initial_n are basic code points (which is true | code points less than initial_n are basic code points (which is true | |||

for AMC-ACE-Z if code points are unsigned). | for AMC-ACE-Z if code points are unsigned). | |||

The brace-enclosed conditions "non-basic" and "or m is basic" can be | The brace-enclosed conditions "non-basic" and "or m is basic" can be | |||

omitted if initial_n exceeds all basic code points (which is true | omitted if initial_n exceeds all basic code points (which is true | |||

for AMC-ACE-Z), because the code point being tested is never less | for AMC-ACE-Z), because the code point being tested is never less | |||

than initial_n. | than initial_n. | |||

The checks for overflow are necessary to avoid producing invalid | The checks for overflow are necessary to avoid producing invalid | |||

output when the input contains very large values or is very long. | output when the input contains very large values or is very long. | |||

Wider integer variables can handle more extreme inputs. For | Wider integer variables can handle more extreme inputs. For IDNA, | |||

AMC-ACE-Z, 26-bit unsigned integers are sufficient, because any | 26-bit unsigned integers are sufficient, because any string that | |||

string that required a 27-bit delta would have to exceed either | required a 27-bit delta would have to exceed either the code point | |||

the code point limit (0..10FFFF) or the label length limit (63 | limit (0..10FFFF) or the label length limit (63 characters). | |||

characters). | ||||

The increment of delta at the bottom of the outer loop cannot | The increment of delta at the bottom of the outer loop cannot | |||

overflow because delta < length(input) before the increment, and | overflow because delta < length(input) before the increment, and | |||

length(input) is already assumed to be representable. The increment | length(input) is already assumed to be representable. The increment | |||

of n could overflow, but only if h == length(input), in which case | of n could overflow, but only if h == length(input), in which case | |||

the procedure is finished anyway. | the procedure is finished anyway. | |||

6.4 Alternative methods for handling overflow | ||||

The encoding and decoding algorithms handle overflow by detecting | ||||

it whenever it happens. Another approach is to enforce limits on | ||||

the inputs that prevent overflow from happening. For example, if | ||||

the encoder were to verify that no input code points exceed M and | ||||

that the input length does not exceed L, then no delta could ever | ||||

exceed (M - initial_n) * (L + 1), and hence no overflow could occur | ||||

if integer variables were capable of representing values that large. | ||||

This prevention approach would impose more restrictions on the input | ||||

than the detection approach does, but might be considered simpler in | ||||

some programming languages. | ||||

In theory, the decoder could use an analogous approach, limiting the | ||||

number of digits in a variable-length integer (that is, limiting the | ||||

number of iterations in the innermost loop). However, the number | ||||

of digits that suffice to represent a given delta can sometimes | ||||

represent much larger deltas (because of the adaptation), and hence | ||||

this approach would probably require integers wider than 32 bits. | ||||

Yet another approach for the decoder is to allow overflow to occur, | ||||

but to check the final output string by re-encoding it and comparing | ||||

to the decoder input. If and only if they do not match (using a | ||||

case-insensitive ASCII comparison) overflow has occurred. This | ||||

delayed-detection approach would not impose any more restrictions on | ||||

the input than the immediate-detection approach does, and might be | ||||

considered simpler in some programming languages. | ||||

7. AMC-ACE-Z example strings | 7. AMC-ACE-Z example strings | |||

In the AMC-ACE-Z encodings below, the IDNA signature prefix is not | In the AMC-ACE-Z encodings below, the IDNA signature prefix is not | |||

shown. AMC-ACE-Z is abbreviated AMC-Z. Backslashes show where line | shown. AMC-ACE-Z is abbreviated AMC-Z. Backslashes show where line | |||

breaks have been inserted in strings too long for one line. | breaks have been inserted in strings too long for one line. | |||

The first several examples are all translations of the sentence "Why | The first several examples are all translations of the sentence "Why | |||

can't they just speak in <language>?" (courtesy of Michael Kaplan's | can't they just speak in <language>?" (courtesy of Michael Kaplan's | |||

"provincial" page [PROVINCIAL]). Word breaks and punctuation have | "provincial" page [PROVINCIAL]). Word breaks and punctuation have | |||

been removed, as is often done in domain names. | been removed, as is often done in domain names. | |||

skipping to change at line 676 | skipping to change at line 731 | |||

Table Specification", 1985-Oct, RFC 952. | Table Specification", 1985-Oct, RFC 952. | |||

[RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", | [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", | |||

1987-Nov, RFC 1034. | 1987-Nov, RFC 1034. | |||

[UNICODE] The Unicode Consortium, "The Unicode Standard", | [UNICODE] The Unicode Consortium, "The Unicode Standard", | |||

http://www.unicode.org/unicode/standard/standard.html. | http://www.unicode.org/unicode/standard/standard.html. | |||

A. Author contact information | A. Author contact information | |||

Adam M. Costello <amc@cs.berkeley.edu> | Adam M. Costello | |||

University of California, Berkeley | University of California, Berkeley | |||

http://www.cs.berkeley.edu/~amc/ | http://www.cs.berkeley.edu/~amc/ | |||

B. Mixed-case annotation | B. Mixed-case annotation | |||

In order to use AMC-ACE-Z to represent case-insensitive strings, | In order to use AMC-ACE-Z to represent case-insensitive strings, | |||

higher layers need to case-fold the strings prior to AMC-ACE-Z | higher layers need to case-fold the strings prior to AMC-ACE-Z | |||

encoding. The encoded string can, however, use mixed case as an | encoding. The encoded string can, however, use mixed case as an | |||

annotation telling how to convert the original folded string into a | annotation telling how to convert the original folded string into a | |||

mixed-case string for display purposes. | mixed-case string for display purposes. | |||

skipping to change at line 701 | skipping to change at line 756 | |||

the last of which provides the annotation. If it is uppercase, | the last of which provides the annotation. If it is uppercase, | |||

it is a suggestion to map the non-basic code point to uppercase | it is a suggestion to map the non-basic code point to uppercase | |||

(if possible); if it is lowercase, it is a suggestion to map the | (if possible); if it is lowercase, it is a suggestion to map the | |||

non-basic code point to lowercase (if possible). | non-basic code point to lowercase (if possible). | |||

AMC-ACE-Z encoders and decoders are not required to support these | AMC-ACE-Z encoders and decoders are not required to support these | |||

annotations, and higher layers need not use them. | annotations, and higher layers need not use them. | |||

C. AMC-ACE-Z sample implementation | C. AMC-ACE-Z sample implementation | |||

/******************************************/ | /********************************************/ | |||

/* amc-ace-z.c 0.3.0 (2001-Aug-07-Tue) */ | /* amc-ace-z.c 0.3.1 (2001-Sep-01-Sat) */ | |||

/* Adam M. Costello <amc@cs.berkeley.edu> */ | /* http://www.cs.berkeley.edu/~amc/charset/ */ | |||

/******************************************/ | /* Adam M. Costello */ | |||

/* http://www.cs.berkeley.edu/~amc/ */ | ||||

/********************************************/ | ||||

/* This is ANSI C code (C89) implementing AMC-ACE-Z version 0.3.x. */ | /* This is ANSI C code (C89) implementing AMC-ACE-Z version 0.3.x. */ | |||

/************************************************************/ | /************************************************************/ | |||

/* Public interface (would normally go in its own .h file): */ | /* Public interface (would normally go in its own .h file): */ | |||

#include <limits.h> | #include <limits.h> | |||

enum amc_ace_status { | enum amc_ace_status { | |||

amc_ace_success, | amc_ace_success, | |||

skipping to change at line 829 | skipping to change at line 887 | |||

/* 0..25 map to ASCII a..z or A..Z */ | /* 0..25 map to ASCII a..z or A..Z */ | |||

/* 26..35 map to ASCII 0..9 */ | /* 26..35 map to ASCII 0..9 */ | |||

} | } | |||

/* flagged(bcp) tests whether a basic code point is flagged */ | /* flagged(bcp) tests whether a basic code point is flagged */ | |||

/* (uppercase). The behavior is undefined if bcp is not a */ | /* (uppercase). The behavior is undefined if bcp is not a */ | |||

/* basic code point. */ | /* basic code point. */ | |||

#define flagged(bcp) ((amc_ace_z_uint)(bcp) - 65 < 26) | #define flagged(bcp) ((amc_ace_z_uint)(bcp) - 65 < 26) | |||

/*** Useful constants ***/ | /*** Platform-specific constants ***/ | |||

/* maxint is the maximum value of an amc_ace_z_uint variable: */ | /* maxint is the maximum value of an amc_ace_z_uint variable: */ | |||

static const amc_ace_z_uint maxint = -1; | static const amc_ace_z_uint maxint = -1; | |||

/* Because maxint is unsigned, -1 becomes the maximum value. */ | ||||

/* lobase and cutoff are used in the calculation of bias: */ | /*** Bias adaptation function ***/ | |||

enum { lobase = base - tmin, cutoff = lobase * tmax / 2 }; | ||||

static amc_ace_z_uint adapt( | ||||

amc_ace_z_uint delta, amc_ace_z_uint numpoints, int firsttime ) | ||||

{ | ||||

amc_ace_z_uint k; | ||||

delta = firsttime ? delta / damp : delta >> 1; | ||||

/* delta >> 1 is a faster way of doing delta / 2 */ | ||||

delta += delta / numpoints; | ||||

for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) { | ||||

delta /= base - tmin; | ||||

} | ||||

return k + (base - tmin + 1) * delta / (delta + skew); | ||||

} | ||||

/*** Main encode function ***/ | /*** Main encode function ***/ | |||

enum amc_ace_status amc_ace_z_encode( | enum amc_ace_status amc_ace_z_encode( | |||

amc_ace_z_uint input_length, | amc_ace_z_uint input_length, | |||

const amc_ace_z_uint input[], | const amc_ace_z_uint input[], | |||

const unsigned char uppercase_flags[], | const unsigned char uppercase_flags[], | |||

amc_ace_z_uint *output_length, | amc_ace_z_uint *output_length, | |||

char output[] ) | char output[] ) | |||

{ | { | |||

skipping to change at line 893 | skipping to change at line 967 | |||

} | } | |||

/* Increase delta enough to advance the decoder's */ | /* Increase delta enough to advance the decoder's */ | |||

/* <n,i> state to <m,0>, but guard against overflow: */ | /* <n,i> state to <m,0>, but guard against overflow: */ | |||

if (m - n > (maxint - delta) / (h + 1)) return amc_ace_overflow; | if (m - n > (maxint - delta) / (h + 1)) return amc_ace_overflow; | |||

delta += (m - n) * (h + 1); | delta += (m - n) * (h + 1); | |||

n = m; | n = m; | |||

for (j = 0; j < input_length; ++j) { | for (j = 0; j < input_length; ++j) { | |||

#if 0 | /* AMC-ACE-Z does not need to check whether input[j] is basic: */ | |||

if (input[j] < n || basic(input[j])) { | if (input[j] < n /* || basic(input[j]) */ ) { | |||

if (++delta == 0) return amc_ace_overflow; | if (++delta == 0) return amc_ace_overflow; | |||

} | } | |||

#endif | ||||

/* AMC-ACE-Z can use this simplified version instead: */ | ||||

if (input[j] < n && ++delta == 0) return amc_ace_overflow; | ||||

if (input[j] == n) { | if (input[j] == n) { | |||

/* Represent delta as a generalized variable-length integer: */ | /* Represent delta as a generalized variable-length integer: */ | |||

for (q = delta, k = base; ; k += base) { | for (q = delta, k = base; ; k += base) { | |||

if (out >= max_out) return amc_ace_big_output; | if (out >= max_out) return amc_ace_big_output; | |||

t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; | t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; | |||

if (q < t) break; | if (q < t) break; | |||

output[out++] = encode_digit(t + (q - t) % (base - t), 0); | output[out++] = encode_digit(t + (q - t) % (base - t), 0); | |||

q = (q - t) / (base - t); | q = (q - t) / (base - t); | |||

} | } | |||

output[out++] = | output[out++] = | |||

encode_digit(q, uppercase_flags && uppercase_flags[j]); | encode_digit(q, uppercase_flags && uppercase_flags[j]); | |||

skipping to change at line 914 | skipping to change at line 984 | |||

for (q = delta, k = base; ; k += base) { | for (q = delta, k = base; ; k += base) { | |||

if (out >= max_out) return amc_ace_big_output; | if (out >= max_out) return amc_ace_big_output; | |||

t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; | t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; | |||

if (q < t) break; | if (q < t) break; | |||

output[out++] = encode_digit(t + (q - t) % (base - t), 0); | output[out++] = encode_digit(t + (q - t) % (base - t), 0); | |||

q = (q - t) / (base - t); | q = (q - t) / (base - t); | |||

} | } | |||

output[out++] = | output[out++] = | |||

encode_digit(q, uppercase_flags && uppercase_flags[j]); | encode_digit(q, uppercase_flags && uppercase_flags[j]); | |||

bias = adapt(delta, h + 1, h == b); | ||||

/* Adapt the bias: */ | ||||

delta = h == b ? delta / damp : delta >> 1; | ||||

delta += delta / (h + 1); | ||||

for (bias = 0; delta > cutoff; bias += base) delta /= lobase; | ||||

bias += (lobase + 1) * delta / (delta + skew); | ||||

delta = 0; | delta = 0; | |||

++h; | ++h; | |||

} | } | |||

} | } | |||

++delta, ++n; | ++delta, ++n; | |||

} | } | |||

*output_length = out; | *output_length = out; | |||

return amc_ace_success; | return amc_ace_success; | |||

} | } | |||

/*** Main decode function ***/ | /*** Main decode function ***/ | |||

enum amc_ace_status amc_ace_z_decode( | enum amc_ace_status amc_ace_z_decode( | |||

skipping to change at line 941 | skipping to change at line 1006 | |||

/*** Main decode function ***/ | /*** Main decode function ***/ | |||

enum amc_ace_status amc_ace_z_decode( | enum amc_ace_status amc_ace_z_decode( | |||

amc_ace_z_uint input_length, | amc_ace_z_uint input_length, | |||

const char input[], | const char input[], | |||

amc_ace_z_uint *output_length, | amc_ace_z_uint *output_length, | |||

amc_ace_z_uint output[], | amc_ace_z_uint output[], | |||

unsigned char uppercase_flags[] ) | unsigned char uppercase_flags[] ) | |||

{ | { | |||

amc_ace_z_uint n, out, i, max_out, bias, b, j, | amc_ace_z_uint n, out, i, max_out, bias, | |||

in, oldi, w, k, delta, digit, t; | b, j, in, oldi, w, k, digit, t; | |||

/* Initialize the state: */ | /* Initialize the state: */ | |||

n = initial_n; | n = initial_n; | |||

out = i = 0; | out = i = 0; | |||

max_out = *output_length; | max_out = *output_length; | |||

bias = initial_bias; | bias = initial_bias; | |||

/* Handle the basic code points: Let b be the number of input code */ | /* Handle the basic code points: Let b be the number of input code */ | |||

/* points before the last delimiter, or 0 if there is none, then */ | /* points before the last delimiter, or 0 if there is none, then */ | |||

skipping to change at line 988 | skipping to change at line 1053 | |||

digit = decode_digit(input[in++]); | digit = decode_digit(input[in++]); | |||

if (digit >= base) return amc_ace_bad_input; | if (digit >= base) return amc_ace_bad_input; | |||

if (digit > (maxint - i) / w) return amc_ace_overflow; | if (digit > (maxint - i) / w) return amc_ace_overflow; | |||

i += digit * w; | i += digit * w; | |||

t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; | t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias; | |||

if (digit < t) break; | if (digit < t) break; | |||

if (w > maxint / (base - t)) return amc_ace_overflow; | if (w > maxint / (base - t)) return amc_ace_overflow; | |||

w *= (base - t); | w *= (base - t); | |||

} | } | |||

/* Adapt the bias: */ | bias = adapt(i - oldi, out + 1, oldi == 0); | |||

delta = oldi == 0 ? i / damp : (i - oldi) >> 1; | ||||

delta += delta / (out + 1); | ||||

for (bias = 0; delta > cutoff; bias += base) delta /= lobase; | ||||

bias += (lobase + 1) * delta / (delta + skew); | ||||

/* i was supposed to wrap around from out+1 to 0, */ | /* i was supposed to wrap around from out+1 to 0, */ | |||

/* incrementing n each time, so we'll fix that now: */ | /* incrementing n each time, so we'll fix that now: */ | |||

if (i / (out + 1) > maxint - n) return amc_ace_overflow; | if (i / (out + 1) > maxint - n) return amc_ace_overflow; | |||

n += i / (out + 1); | n += i / (out + 1); | |||

i %= (out + 1); | i %= (out + 1); | |||

/* Insert n at position i of the output: */ | /* Insert n at position i of the output: */ | |||

skipping to change at line 1190 | skipping to change at line 1250 | |||

if (r < 0) fail(io_error); | if (r < 0) fail(io_error); | |||

} | } | |||

return EXIT_SUCCESS; | return EXIT_SUCCESS; | |||

} | } | |||

usage(argv); | usage(argv); | |||

return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ | return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ | |||

} | } | |||

INTERNET-DRAFT expires 2002-Feb-16 | INTERNET-DRAFT expires 2002-Mar-04 | |||

End of changes. 41 change blocks. | ||||

72 lines changed or deleted | | 132 lines changed or added | ||

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |