draft-ietf-idn-lace-00.txt   draft-ietf-idn-lace-01.txt 
Internet Draft Mark Davis Internet Draft Mark Davis
draft-ietf-idn-lace-00.txt IBM draft-ietf-idn-lace-01.txt IBM
November 6, 2000 Paul Hoffman January 5, 2001 Paul Hoffman
Expires May 6, 2001 IMC & VPNC Expires July 5, 2001 IMC & VPNC
LACE: Length-based ASCII Compatible Encoding for IDN LACE: Length-based ASCII Compatible Encoding for IDN
Status of this memo Status of this memo
This document is an Internet-Draft and is in full conformance with all This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026. provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other Force (IETF), its areas, and its working groups. Note that other
skipping to change at line 52 skipping to change at line 52
There is a strong world-wide desire to use characters other than plain There is a strong world-wide desire to use characters other than plain
ASCII in host names. Host names have become the equivalent of business ASCII in host names. Host names have become the equivalent of business
or product names for many services on the Internet, so there is a need or product names for many services on the Internet, so there is a need
to make them usable by people whose native scripts are not representable to make them usable by people whose native scripts are not representable
by ASCII. The requirements for internationalizing host names are by ASCII. The requirements for internationalizing host names are
described in the IDN WG's requirements document, [IDNReq]. described in the IDN WG's requirements document, [IDNReq].
The IDN WG's comparison document [IDNComp] describes three potential The IDN WG's comparison document [IDNComp] describes three potential
main architectures for IDN: arch-1 (just send binary), arch-2 (send main architectures for IDN: arch-1 (just send binary), arch-2 (send
binary or ACE), and arch-3 (just send ACE). LACE is an ACE, called binary or ACE), and arch-3 (just send ACE). LACE is an ACE, called
Row-based ACE or LACE, that can be used with protocols that match arch-2 Length-based ACE or LACE, that can be used with protocols that match arch-2
or arch-3. LACE specifies an ACE format as specified in ace-1 in or arch-3. LACE specifies an ACE format as specified in ace-1 in
[IDNComp]. Further, it specifies an identifying mechanism for ace-2 in [IDNComp]. Further, it specifies an identifying mechanism for ace-2 in
[IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to the [IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to the
beginning of the name part). beginning of the name part).
In formal terms, LACE describes a character encoding scheme of the In formal terms, LACE describes a character encoding scheme of the
ISO/IEC 10646 [ISO10646] coded character set (whose assignment of ISO/IEC 10646 [ISO10646] coded character set (whose assignment of
characters is synchronized with Unicode [Unicode3]) and the rules for characters is synchronized with Unicode [Unicode3]) and the rules for
using that scheme in the DNS. As such, it could also be called a using that scheme in the DNS. As such, it could also be called a
"charset" as defined in [IDNReq]. "charset" as defined in [IDNReq]. It can also be viewed as a specialized
UTF (transformation format), designed to work within the restrictions of
the DNS.
The LACE protocol has the following features: The LACE protocol has the following features:
- There is exactly one way to convert internationalized host parts to - There is exactly one way to convert internationalized host parts to
and from LACE parts. Host name part uniqueness is preserved. and from LACE parts. Host name part uniqueness is preserved.
- Host parts that have no international characters are not changed. - Host parts that have no international characters are not changed.
- Names using LACE can include more internationalized characters than - Names using LACE can include more internationalized characters than
with other ACE protocols that have been suggested to date. LACE-encoded with other ACE protocols that have been suggested to date. LACE-encoded
skipping to change at line 96 skipping to change at line 98
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119 "MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119]. [RFC2119].
Hexadecimal values are shown preceded with an "0x". For example, Hexadecimal values are shown preceded with an "0x". For example,
"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are "0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
shown preceded with an "0b". For example, a nine-bit value might be shown preceded with an "0b". For example, a nine-bit value might be
shown as "0b101101111". shown as "0b101101111".
Examples in this document use the notation from the Unicode Standard Examples in this document use the notation for code points and names
[Unicode3] as well as the ISO 10646 names. For example, the letter "a" from the Unicode Standard [Unicode3] and ISO 10646. For example, the
may be represented as either "U+0061" or "LATIN SMALL LETTER A". letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER
A".
LACE converts strings with internationalized characters into LACE converts strings with internationalized characters into
strings of US-ASCII that are acceptable as host name parts in current strings of US-ASCII that are acceptable as host name parts in current
DNS host naming usage. The former are called "pre-converted" and the DNS host naming usage. The former are called "pre-converted" and the
latter are called "post-converted". latter are called "post-converted".
1.2 IDN summary 1.2 IDN summary
Using the terminology in [IDNComp], LACE specifies an ACE format as Using the terminology in [IDNComp], LACE specifies an ACE format as
specified in ace-1. Further, it specifies an identifying mechanism for specified in ace-1. Further, it specifies an identifying mechanism for
ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginning ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginning
of the name part). of the name part).
LACE has the following length characteristics. In this list, "row" means LACE has the following length characteristics.
a row from ISO 10646.
- LACE-encoded names are variable length, depending on the number of - LACE-encoded names are variable length, depending on the number of
transitions between rows that appear in the name part. transitions between rows that appear in the name part.
- Name parts that cannot be compressed using run-length encoding can - Name parts that cannot be compressed using run-length encoding can
have up to 17 characters. have up to 17 characters.
- Names that can be compressed can have up to 35 characters. - Names that can be compressed can have up to 35 characters.
-A name that has just a few row transitions typically can have over 30 -A name that has just a few row transitions typically can have over 30
skipping to change at line 138 skipping to change at line 140
According to [STD13], host parts must be case-insensitive, start and According to [STD13], host parts must be case-insensitive, start and
end with a letter or digit, and contain only letters, digits, and the end with a letter or digit, and contain only letters, digits, and the
hyphen character ("-"). This, of course, excludes any internationalized hyphen character ("-"). This, of course, excludes any internationalized
characters, as well as many other characters in the ASCII character characters, as well as many other characters in the ASCII character
repertoire. Further, domain name parts must be 63 octets or shorter in repertoire. Further, domain name parts must be 63 octets or shorter in
length. length.
2.1 Name tagging 2.1 Name tagging
All post-converted name parts that contain internationalized characters All post-converted name parts that contain internationalized characters
begin with the string "bq--". (Of course, because host name parts are begin with the string "lq--". (Of course, because host name parts are
case-insensitive, this might also be represented as "Bq--" or "bQ--" or case-insensitive, this might also be represented as "Lq--" or "lQ--" or
"BQ--".) The string "bq--" was chosen because it is extremely unlikely "LQ--".) The string "lq--" was chosen because it is extremely unlikely
to exist in host parts before this specification was produced. As a to exist in host parts before this specification was produced. As a
historical note, in late August 2000, none of the second-level host name historical note, in late October 2000, none of the second-level host
parts in any of the .com, .edu, .net, and .org top-level domains began name parts in any of the .com, .edu, .net, and .org top-level domains
with "bq--"; there are many tens of thousands of other strings of three began with "lq--"; there are many tens of thousands of other strings of
characters followed by a hyphen that have this property and could be three characters followed by a hyphen that have this property and could
used instead. The string "bq--" will change to other strings with the be used instead. The string "lq--" will change to other strings with the
same properties in future versions of this draft. same properties in future versions of this draft.
Note that a zone administrator might still choose to use "bq--" at the Note that a zone administrator might still choose to use "lq--" at the
beginning of a host name part even if that part does not contain beginning of a host name part even if that part does not contain
internationalized characters. Zone administrators SHOULD NOT create host internationalized characters. Zone administrators SHOULD NOT create host
part names that begin with "bq--" unless those names are post-converted part names that begin with "lq--" unless those names are post-converted
names. Creating host part names that begin with "bq--" but that are not names. Creating host part names that begin with "lq--" but that are not
post-converted names may cause two distinct problems. Some display post-converted names may cause two distinct problems. Some display
systems, after converting the post-converted name part back to an systems, after converting the post-converted name part back to an
internationalized name part, might display the name parts in a internationalized name part, might display the name parts in a
possibly-confusing fashion to users. More seriously, some resolvers, possibly-confusing fashion to users. More seriously, some resolvers,
after converting the post-converted name part back to an after converting the post-converted name part back to an
internationalized name part, might reject the host name if it contains internationalized name part, might reject the host name if it contains
illegal characters. illegal characters.
2.2 Converting an internationalized name to an ACE name part 2.2 Converting an internationalized name to an ACE name part
skipping to change at line 178 skipping to change at line 180
If a name part consists exclusively of characters that conform to the If a name part consists exclusively of characters that conform to the
host name requirements in [STD13], the name MUST NOT be converted to host name requirements in [STD13], the name MUST NOT be converted to
LACE. That is, a name part that can be represented without LACE MUST NOT LACE. That is, a name part that can be represented without LACE MUST NOT
be encoded using LACE. This absolute requirement prevents there from be encoded using LACE. This absolute requirement prevents there from
being two different encodings for a single DNS host name. being two different encodings for a single DNS host name.
If any checking for prohibited name parts (such as ones that are If any checking for prohibited name parts (such as ones that are
prohibited characters, case-folding, or canonicalization) is to be done, prohibited characters, case-folding, or canonicalization) is to be done,
it MUST be done before doing the conversion to an ACE name part. it MUST be done before doing the conversion to an ACE name part.
Characters outside the first plane of characters (those with codepoints
above U+FFFF) MUST be represented using surrogates, as described in
RFC 2781 [RFC2781].
The input name string consists of characters from the ISO 10646 The input name string consists of characters from the ISO 10646
character set in big-endian UTF-16 encoding. This is the pre-converted character set in big-endian UTF-16 encoding. This is the pre-converted
string. string.
Characters outside the first plane of characters 2.2.1 Check the input string for disallowed names
(those with codepoints above U+FFFF) MUST be represented using surrogates, as
described in the UTF-16 description in ISO 10646.
2.2.1 Compress the pre-converted string If the input string consists only of characters that conform to the host
name requirements in [STD13], the conversion MUST stop with an error.
2.2.2 Compress the pre-converted string
The entire pre-converted string MUST be compressed using the compression The entire pre-converted string MUST be compressed using the compression
algorithm specified in section 2.4. The result of this step is the algorithm specified in section 2.4. The result of this step is the
compressed string. compressed string.
2.2.2 Check the length of the compressed string 2.2.3 Check the length of the compressed string
The compressed string MUST be 36 octets or shorter. If the compressed The compressed string MUST be 36 octets or shorter. If the compressed
string is 37 octets or longer, the conversion MUST stop with an error. string is 37 octets or longer, the conversion MUST stop with an error.
2.2.3 Encode the compressed string with Base32 2.2.4 Encode the compressed string with Base32
The compressed string MUST be converted using the Base32 encoding The compressed string MUST be converted using the Base32 encoding
described in section 2.5. The result of this step is the encoded string. described in section 2.5. The result of this step is the encoded string.
2.2.4 Prepend "bq--" to the encoded string and finish 2.2.5 Prepend "lq--" to the encoded string and finish
Prepend the characters "bq--" to the encoded string. This is the host Prepend the characters "lq--" to the encoded string. This is the host
name part that can be used in DNS resolution. name part that can be used in DNS resolution.
2.3 Converting a host name part to an internationalized name 2.3 Converting a host name part to an internationalized name
The input string for conversion is a valid host name part. Note that if The input string for conversion is a valid host name part. Note that if
any checking for prohibited name parts (such as prohibited characters, any checking for prohibited name parts (such as prohibited characters,
case-folding, or canonicalization is to be done, it MUST be done after case-folding, or canonicalization is to be done, it MUST be done after
doing the conversion from an ACE name part. doing the conversion from an ACE name part.
If a decoded name part consists exclusively of characters that conform If a decoded name part consists exclusively of characters that conform
to the host name requirements in [STD13], the conversion from LACE MUST to the host name requirements in [STD13], the conversion from LACE MUST
fail. Because a name part that can be represented without LACE MUST NOT fail. Because a name part that can be represented without LACE MUST NOT
be encoded using LACE, the decoding process MUST check for name parts be encoded using LACE, the decoding process MUST check for name parts
that consists exclusively of characters that conform to the host name that consists exclusively of characters that conform to the host name
requirements in [STD13] and, if such a name part is found, MUST requirements in [STD13] and, if such a name part is found, MUST
beconsidered an error (and possibly a security violation). beconsidered an error (and possibly a security violation).
2.3.1 Strip the "bq--" 2.3.1 Strip the "lq--"
The input string MUST begin with the characters "bq--". If it does not, The input string MUST begin with the characters "lq--". If it does not,
the conversion MUST stop with an error. Otherwise, remove the characters the conversion MUST stop with an error. Otherwise, remove the characters
"bq--" from the input string. The result of this step is the stripped "lq--" from the input string. The result of this step is the stripped
string. string.
2.3.2 Decode the stripped string with Base32 2.3.2 Decode the stripped string with Base32
The entire stripped string MUST be checked to see if it is valid Base32 The entire stripped string MUST be checked to see if it is valid Base32
output. The entire stripped string MUST be changed to all lower-case output. The entire stripped string MUST be changed to all lower-case
letters and digits. If any resulting characters are not in Table 1, the letters and digits. If any resulting characters are not in Table 1, the
conversion MUST stop with an error; the input string is the conversion MUST stop with an error; the input string is the
post-converted string. Otherwise, the entire resulting string MUST be post-converted string. Otherwise, the entire resulting string MUST be
converted to a binary format using the Base32 decoding described in converted to a binary format using the Base32 decoding described in
section 2.5. The result of this step is the decoded string. section 2.5. The result of this step is the decoded string.
2.3.3 Decompress the decoded string 2.3.3 Decompress the decoded string
The entire decoded string MUST be converted to ISO 10646 characters The entire decoded string MUST be converted to ISO 10646 characters
using the decompression algorithm described in section 2.4. The result using the decompression algorithm described in section 2.4. The result
of this is the internationalized string. of this is the internationalized string.
2.3.4 Check the internationalized string for disallowed names
If the internationalized string consists only of characters that conform
to the host name requirements in [STD13], the conversion MUST stop with
an error.
2.4 Compression algorithm 2.4 Compression algorithm
The basic method for compression is to reduce a substring that consists The basic method for compression is to reduce a substring that consists
of characters all from a single row of the ISO 10646 repertoire to a of characters all from a single row of the ISO 10646 repertoire to a
count octet followed by the row header followed by the lower octets of count octet followed by the row header followed by the lower octets of
the characters. If this ends up being longer than the input, the string the characters. If this ends up being longer than the input, the string
is not compressed, but instead has a unique one-octet header attached. is not compressed, but instead has a unique one-octet header attached.
Although the uncompressed mode limits the number of characters in a LACE Although the uncompressed mode limits the number of characters in a LACE
name part to 17, this is still generally enough for almost all names in name part to 17, this is still generally enough for all names in almost
almost scripts. Also, this limit is close to the limits set by other scripts. Also, this limit is close to the limits set by other encoding
encoding proposals. proposals.
Note that the compression and decompression rules MUST be followed Note that the compression and decompression rules MUST be followed
exactly. This requirement prevents a single host name part from having exactly. This requirement prevents a single host name part from having
two encodings. Thus, for any input to the algorithm, there is only one two encodings. Thus, for any input to the algorithm, there is only one
possible output. An implementation cannot chose to use one-octet mode or possible output. An implementation cannot chose to use one-octet mode or
two-octet mode using anything other than the logic given in this two-octet mode using anything other than the logic given in this
section. section.
2.4.1 Compressing a string 2.4.1 Compressing a string
The input string is in big-endian UTF-16 encoding with no byte order The input string is in the UTF-16 encoding (big-endian UTF-16 with no
mark. byte order mark).
Design note: No checking is done on the input to this algorithm. It is Design note: No checking is done on the input to this algorithm. It is
assumed that all checking for valid ISO/IEC 10646 characters has already assumed that all checking for valid ISO/IEC 10646 characters has already
been done by a previous step in the conversion process. been done by a previous step in the conversion process.
1) If the length of the input is not even, or is less than 2, stop with 1) If the length (measured in octets) of the input is not even, or is
an error. less than 2, stop with an error.
2) Set the input pointer, called IP, to the first octet of the input 2) Set the input pointer, called IP, to the first octet of the input
string. string.
3) Set the variable called HIGH to the octet at IP. 3) Set the variable called HIGH to the octet at IP.
4) Determine the number of pairs at or after IP that have HIGH as the 4) Determine the number of contiguous pairs at or after IP that have
first octet; call this COUNT. HIGH as the first octet; call this COUNT.
5) Put into an output buffer the single octet for COUNT followed by the 5) Put into an output buffer the single octet for COUNT followed by the
single octet for HIGH, followed by all those low octets. Move IP to the single octet for HIGH, followed by all those low octets. Move IP to the
end of those pairs; that is, set IP to IP+(2*(COUNT+1)). end of those pairs; that is, set IP to IP+(2*COUNT).
6) If IP is not at the end of the input string, go to step 3. 6) If IP is not at the end of the input string, go to step 3.
7) If the length of the output buffer is less than or equal to the 7) If the length of the output buffer is less than or equal to the
length of the input buffer (in octets, not in characters), output the length of the input buffer (in octets, not in characters), emit the
buffer. Otherwise, output the octet 0xFF followed by the input buffer. output buffer. Otherwise, output the octet 0xFF followed by the input
Note that there can only be one possible representation for a name part, buffer. Note that there can only be one possible representation for a
so that outputting the wrong name part is a serious security error. name part, so that outputting the wrong name part is a serious security
Decompression schemes MUST accept only the valid form and MUST NOT error. Decompression schemes MUST accept only the valid form and MUST
accept invalid forms. NOT accept invalid forms.
2.4.2 Decompressing a string 2.4.2 Decompressing a string
1. Set the input pointer, called IP, to the first octet of the input 1. Set the input pointer, called IP, to the first octet of the input
string. If there is no first octet, stop with an error. string. If there is no first octet, stop with an error.
2. If the octet at IP is 0xFF, go to step 10. 2. If the octet at IP is 0xFF, set IP to IP+1, copy the rest of the
input buffer to the output buffer, and go to step 9.
3. Get the octet at IP, call it COUNT. Set IP to IP+1. If IP is now at 3. Get the octet at IP, call it COUNT. If COUNT equals zero or is
the end of the input string, stop with an error. greater than 36, stop with an error. Set IP to IP+1. If IP is now at the
end of the input string, stop with an error.
4. Get the octet at IP, call it HIGH. Set IP to IP+1. If IP is now at 4. Get the octet at IP, call it HIGH. Set IP to IP+1.
the end of the input string, stop with an error.
5. Get the octet at IP, call it LOW. Set IP to IP+1. 5. If IP is now at the end of the input string, stop with an error. Get
the octet at IP, call it LOW. Set IP to IP+1.
6. Output HIGH, then LOW, to the output buffer. 6. Output HIGH, then LOW, to the output buffer.
7. Decrement COUNT. If COUNT is greater than 0, go to step 5. 7. Decrement COUNT. If COUNT is greater than 0, go to step 5.
8. If IP is not at the end of the input buffer, go to step 3. 8. If IP is not at the end of the input buffer, go to step 3.
9. Compare the length of the input string with the length of the output 9. If the length of the output buffer is odd, stop with an error.
buffer. If the length of the output buffer is longer than the length of Compress the output buffer into a separate comparison buffer following
the input buffer, stop with an error because the wrong compression form the steps for compression above. If the contents of the comparison
was used. Otherwise, send out the output buffer and stop. buffer does not equal the input to the compression step, stop with an
error. Otherwise, send out the output buffer and stop.
10. Set IP to IP+1. Copy the rest of the input buffer to the output
buffer. Compress the output buffer into a separate comparison buffer
following the steps for compression above. If the length of the
comparison buffer is less than or equal to the length of the output
buffer, stop with an error because the wrong compression form was used.
Otherwise, send out the output buffer and stop.
2.4.3 Compression examples 2.4.3 Compression examples
The five input characters <U+30E6 U+30CB U+30B3 U+30FC U+30C9> are The five input characters <U+30E6 U+30CB U+30B3 U+30FC U+30C9> are
represented in big-endian UTF-16 as the ten octets <30 E6 30 CB 30 B3 30 represented in big-endian UTF-16 as the ten octets <30 E6 30 CB 30 B3 30
FC 30 C9>. All the code units are in the same row (03). The output FC 30 C9>. All the code units are in the same row (03). The output
buffer has seven octets <05 30 E6 CB B3 FC C9>, which is shorter than buffer has seven octets <05 30 E6 CB B3 FC C9>, which is shorter than
the input string. Thus the output is <05 30 E6 CB B3 FC C9>. the input string. Thus the output is <05 30 E6 CB B3 FC C9>.
The four input characters <U+012E U+0110 U+014A U+00C5> are represented The four input characters <U+012F U+0111 U+0149 U+00E5> are represented
in big-endian UTF-16 as the eight octets <01 2E 01 10 01 4A 00 C5>. The in big-endian UTF-16 as the eight octets <01 2F 01 11 01 49 00 E5>. The
output buffer has eight octets <03 01 2E 10 4A 01 00 C5>, which is the output buffer has eight octets <03 01 2F 11 49 01 00 E5>, which is the
same length as the input string. Thus, the output is <03 01 2E 10 4A 01 same length as the input string. Thus, the output is <03 01 2F 11 49 01
00 C5>. 00 E5>.
The three input characters <U+012E U+00D0 U+014A> are represented in The three input characters <U+012F U+00E0 U+014B> are represented in
big-endian UTF-16 as the six octets <01 2E 00 D0 01 4A>. The output big-endian UTF-16 as the six octets <01 2F 00 E0 01 4B>. The output
buffer is nine octets <01 01 2E 01 00 D0 01 01 4A>, which is longer than buffer is nine octets <01 01 2F 01 00 E0 01 01 4B>, which is longer than
the input buffer. Thus, the output is <FF 01 2E 00 D0 01 4A>. the input buffer. Thus, the output is <FF 01 2F 00 E0 01 4B>.
2.5 Base32 2.5 Base32
In order to encode non-ASCII characters in DNS-compatible host name parts, In order to encode non-ASCII characters in DNS-compatible host name parts,
they must be converted into legal characters. This is done with Base32 they must be converted into legal characters. This is done with Base32
encoding, described here. encoding, described here.
Table 1 shows the mapping between input bits and output characters in Table 1 shows the mapping between input bits and output characters in
Base32. Design note: the digits used in Base32 are "2" through "7" Base32. Design note: the digits used in Base32 are "2" through "7"
instead of "0" through "6" in order to avoid digits "0" and "1". This instead of "0" through "6" in order to avoid digits "0" and "1". This
skipping to change at line 416 skipping to change at line 425
6) Look up the value of the set of five bits in the bits column of 6) Look up the value of the set of five bits in the bits column of
Table 1, and output the character from the char column (whose hex value Table 1, and output the character from the char column (whose hex value
is in the hex column). is in the hex column).
2.5.2 Decoding Base32 as octets 2.5.2 Decoding Base32 as octets
The input is octets in network byte order. The input octets MUST be The input is octets in network byte order. The input octets MUST be
values from the second column in Table 1. values from the second column in Table 1.
1) Set the read pointer to the beginning of the input octet stream. 1) Count the number of octets in the input and divide it by 8; call the
remainder INPUTCHECK. If INPUTCHECK is 1 or 3 or 6, stop with an error.
2) Look up the character value of the octet in the char column (or hex 2) Set the read pointer to the beginning of the input octet stream.
value in hex column) of Table 1, and output the five bits from the bits
column.
3) Move the read pointer one octet forward. If the read pointer is at 3) Look up the character value of the octet in the char column (or hex
the end of the input octet stream (that is, there are no more octets in value in hex column) of Table 1, and add the five bits from the bits
the input), stop. Otherwise, go to step 2. column to the output buffer.
4) Move the read pointer one octet forward. If the read pointer is not
at the end of the input octet stream (that is, there are more octets in
the input), go to step 3.
5) Count the number of bits that are in the output buffer and divide it
by 8; call the remainder PADDING. If the PADDING number of bits at the
end of the output buffer are not all zero, stop with an error.
Otherwise, emit the output buffer and stop.
2.5.3 Base32 example 2.5.3 Base32 example
Assume you want to encode the value 0x3a270f93. The bit string is: Assume you want to encode the value 0x3a270f93. The bit string is:
3 a 2 7 0 f 9 3 3 a 2 7 0 f 9 3
00111010 00100111 00001111 10010011 00111010 00100111 00001111 10010011
Broken into chunks of five bits, this is: Broken into chunks of five bits, this is:
skipping to change at line 484 skipping to change at line 501
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- technology -- Universal Multiple-Octet Coded Character Set (UCS) --
Part 1: Architecture and Basic Multilingual Plane. Five amendments and Part 1: Architecture and Basic Multilingual Plane. Five amendments and
a technical corrigendum have been published up to now. UTF-16 is a technical corrigendum have been published up to now. UTF-16 is
described in Annex Q, published as Amendment 1. 17 other amendments are described in Annex Q, published as Amendment 1. 17 other amendments are
currently at various stages of standardization. [[[ THIS REFERENCE currently at various stages of standardization. [[[ THIS REFERENCE
NEEDS TO BE UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]] NEEDS TO BE UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119. Requirement Levels", March 1997, RFC 2119.
[RFC2781] Paul Hoffman and Francois Yergeau, "UTF-16, an encoding of ISO
10646", February 2000, RFC 2781.
[STD13] Paul Mockapetris, "Domain names - implementation and [STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035). specification", November 1987, STD 13 (RFC 1035).
[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version [Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
3.0", ISBN 0-201-61633-5. Described at 3.0", ISBN 0-201-61633-5. Described at
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>. <http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
A. Acknowledgements A. Acknowledgements
Rick Wesson pointed out some error conditions that need to be
tested for. Scott Hollenbeck pointed out some errors in the
compression.
Base32 is quite obviously inspired by the tried-and-true Base64 Base32 is quite obviously inspired by the tried-and-true Base64
Content-Transfer-Encoding from MIME. Content-Transfer-Encoding from MIME.
B. IANA Considerations B. Sample code
There are no IANA considerations in this document. The following is sample Javascript code for the LACE algorithm.
This code is believed to be correct, but there may be errors in
it. The code is provided as-is and comes with no warranty of
fitness, correctness, blah blah blah.
C. Author Contact Information /**
* Converts to LACE compression format (without Base32) from
* UTF-16BE array
* @parameter iArray Array of bytes in UTF16-BE
* @parameter iCount Number of elements. Must be 0..63
* @parameter oArray Array for output of LACE bytes.
* Must be at least 100 octets long to provide internal working space
* @return Length of output array used
* @parameter parseResult output error value if any
* @author Mark Davis
*/
function toLACE(iArray, iCount, oArray, parseResult) {
//debugger;
if (iCount < 1 || iCount > 62) {
parseResult.set("Lace: count out of range", iCount);
return;
}
if ((iCount % 2) == 1) {
parseResult.set("Lace: odd length, can't be UTF-16", iCount);
return;
}
var op = 0; // input index
var ip = 0; // output index
var lastHigh = -1;
var lenp = 0;
while (ip < iCount) {
var high = iArray[ip++];
if (high != lastHigh) {
if (lastHigh != -1) { // store last length
var len = op - lenp - 2;
oArray[lenp] = len;
}
lenp = op++; // reserve space
oArray[op++] = high;
lastHigh = high;
}
oArray[op++] = iArray[ip++];
}
// store last len
var len = op - lenp - 2;
oArray[lenp] = len;
// see if the input is short, and we should
// just copy
if (op > iCount) {
if (op > 63) {
parseResult.set("Lace: output too long", op);
return;
}
oArray[0] = 0xFF;
copyTo(iArray, 0, iCount, oArray, 1);
op = iCount + 1;
}
return op;
}
/**
* Converts from LACE compressed format (without Base32) to
* UTF-16BE array
* @parameter iArray Array of bytes in LACE format
* @parameter iCount Number of elements
* @parameter oArray Array for output of bytes, UTF16-BE.
* Must be at least iCount+1 long
* @return Length of output array used
* @parameter parseResult output error value if any
* @author Mark Davis
*/
function fromLACE(iArray, iCount, oArray, parseResult) {
var high;
if (iCount < 1 || iCount > 63) {
parseResult.set("fromLACE: count out of range", iCount);
return;
}
var op = 0;
var ip = 0;
var result = 0;
if (iArray[ip] == 0xFF) { // special case FF
copyTo(iArray, 1, iCount-1, oArray, 0);
result = iCount-1;
} else {
while (ip < iCount) { // loop over runs
var count = iArray[ip++];
if (ip == iCount) {
parseResult.set("fromLACE: truncated before high", ip);
return;
}
high = iArray[ip++];
for (var i = 0; i < count; ++i) {
oArray[op++] = high;
if (ip == iCount) {
parseResult.set("fromLACE: truncated from count", ip);
return;
}
oArray[op++] = iArray[ip++];
}
}
result = op;
}
// check for uniqueness
var checkArray = [];
var checkCount = toLACE(oArray, result, checkArray, parseResult);
if (!equals(iArray, iCount, checkArray, checkCount)) {
parseResult.set("fromLACE: illegal input form");
return;
}
return result;
}
/**
* Utility routine for comparing arrays
* @parameter array1 first array to compare
* @parameter count1 number of elements to compare in first array
* @parameter array2 second array to compare
* @parameter count1 number of elements to compare in second array
* @return true iff counts are same, and elements from 0 to count-1
* are the same
*/
function equals(array1, count1, array2, count2) {
if (count1 != count2) return false;
for (var i = 0; i < count1; ++i) {
if (array1[i] != array2[i]) return false;
}
return true;
}
/**
* Utility routine for getting array of bytes from UTF-16 string
* @parameter str source string
* @parameter oArray output array to fill in
* @return count of bytes put into oArray
*/
function utf16FromString(str, oArray) {
var op = 0;
for (var i = 0; i < str.length; ++i) {
var code = str.charCodeAt(i);
oArray[op++] = (code >>> 8); // top byte
oArray[op++] = (code & 0xFF); // bottom byte
}
return op;
}
/**
* Utility routine to see if string doesn't need LACE
* @parameter str source string
* @return true if ok already
*/
function okAlready(str) {
for (var i = 0; i < str.length; ++i) {
var c = str.charAt(i);
if (c == '-' || 'a' <= c && c <= 'z' || '0' <= c && c <= '9')
continue;
return false;
}
return true
}
/**
* Convert from bytes to base32
* @parameter input Input buffer of bytes with values 00 to FF
* @parameter inputLength Length of input buffer
* @parameter output Output buffer, to be filled with with values from
a-z2-7.
* Must be of at least length input*8/5 + 1
* @return Length of output buffer used
* @author Mark Davis
*/
function toBase32(input, inputLength, output, parseResult) {
//debugger;
var bits = 0;
var bitCount = 0;
var ip = 0;
var op = 0;
var val = 0;
while (true) {
// get bits if we don't have enough
if (bitCount < 5) {
if (ip >= inputLength) break;
// get another input
bits <<= 8;
if (baseDebugTo) alert("byte: " + input[ip].toString(16) + ",
bitCount: " + (bitCount+8));
bits = bits | input[ip++];
bitCount += 8;
}
// emit and remove them
bitCount -= 5;
val = (bits >> bitCount);
if (baseDebugTo) alert("Val: " + val.toString(16) + ", bitCount: "
+ bitCount);
output[op++] = toLetter(val);
//if (baseDebugTo) alert("out: " + output[op-1].toString(16));
bits &= ~(0x1F << bitCount);
}
// add padding and output if necessary
if (bitCount > 0) {
if (baseDebugTo) alert("bits*: " + bits.toString(16) +
", bitCount: " + bitCount);
val = bits << (5 - bitCount);
if (baseDebugTo) alert("out*: " + val.toString(16));
output[op++] = toLetter(val);
}
return op;
}
/**
* Convert from base32 to bytes
* @parameter input Input buffer of bytes with values from a-z2-7
* @parameter inputLength Length of input buffer
* @parameter output Output buffer, to be filled with bytes from
* 00 to FF
* Must be of at least length input*5/8 + 1
* @return Length of output buffer used
* @author Mark Davis
*/
function fromBase32(input, inputLength, output, parseResult) {
//debugger;
var inputCheck = inputLength % 8;
if (inputCheck == 1 || inputCheck == 3 || inputCheck == 6) {
parseResult.set("Base32 excess length", null, inputLength);
return;
}
var bits = 0;
var bitCount = 0;
var ip = 0;
var op = 0;
var val = 0;
while (ip < inputLength) {
// get more bits
var val = input[ip++];
val = fromLetter(val);
if (val < 0 || val > 0x3F) {
parseResult.set("Bad Base32 byte", val, ip-1);
return;
}
if (baseDebugFrom) alert("base32: " + val.toString(16));
bits <<= 5;
bits = bits | val;
bitCount += 5;
if (baseDebugFrom) alert("from: " + val.toString(16) +
", bitCount: " + bitCount);
// emit & remove if we can
if (bitCount >= 8) {
bitCount -= 8;
output[op++] = bits >> bitCount;
if (baseDebugFrom) alert("out2: " + (bits >> bitCount) +
", bitCount: " + bitCount);
bits &= ~(0xFF << bitCount);
}
}
// check that padding is with zero!
if (bits != 0) return -ip;
return op;
}
function toLetter(val) {
if (val > 25) return val - 26 + 0x32;
return val + 0x61;
// return val + (val < 26 ? 0x61 : 0x18);
}
function fromLetter(val) {
if (val < 0x61) return val + 26 - 0x32;
return val - 0x61;
}
C. Difrerences between -00 and -01
1: Minor typos.
2.1: Changed the tag to 'lq--'.
2.2 and 2.3: Added check for all-STD13 names in the steps.
2.4.1: Clarified first sentence. Step 5: fixed the moving of the IP.
2.4.2: Moved the last sentence of step 4 to be the first sentence of
step 5. Added the check for odd-length output. Changed the exit
comparision to doing a full comparison (instead of looking for lengths).
2.5.2: Changed the sense of the test in step 3 and added step 4 to check
for malformed input. Also made the output a buffer. Also added new step
1.
Changed Appendix B from IANA Considerations (of which there are none) to
Javascript code sample.
D. Author Contact Information
Mark Davis Mark Davis
IBM IBM
10275 N. De Anza Blvd 10275 N. De Anza Blvd
Cupertino, CA 95014 Cupertino, CA 95014
mark.davis@us.ibm.com and mark.davis@macchiato.com mark.davis@us.ibm.com and mark.davis@macchiato.com
Paul Hoffman Paul Hoffman
Internet Mail Consortium and VPN Consortium Internet Mail Consortium and VPN Consortium
127 Segre Place 127 Segre Place
 End of changes. 41 change blocks. 
84 lines changed or deleted 420 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/