[Docs] [txt|pdf] [Tracker] [WG] [Email] [Diff1] [Diff2] [Nits]

Versions: 00 01 02 03 04 05 RFC 2640

  FTPEXT Working Group                                         B. Curtin
  INTERNET DRAFT                      Defense Information Systems Agency
  Expires 10 September 1997                                10 March 1997


            Internationalization of the File Transfer Protocol
                    <draft-ietf-ftpext-intl-ftp-01.txt>


  Status of this Memo

     This document is an Internet-Draft.  Internet-Drafts are working
     documents of the Internet Engineering Task Force (IETF), its
     areas, and its working groups. Note that other groups may also
     distribute working documents as Internet-Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months. Internet-Drafts may be updated, replaced, or obsoleted by
     other documents at any time.  It is not appropriate to use
     Internet-Drafts as reference material or to cite them other than
     as a "working draft" or "work in progress".

     To learn the current status of any Internet-Draft, please check
     the 1id-abstracts.txt listing contained in the Internet-Drafts
     Shadow Directories on ds.internic.net (US East Coast),
     nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or
     munnari.oz.au (Pacific Rim).

     Distribution of this document is unlimited.  Please send comments
     to the FTP Extension working group (FTPEXT-WG) of the Internet
     Engineering Task Force (IETF) at <ftp-wg@hops.ag.utk.edu>.
     Subscription address is <ftp-wg-request@hops.ag.utk.edu>.
     Discussions of the group are archived at
     <URL:ftp://hops.ag.utk.edu/ftp-wg/archives/>.

     The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
     NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and
     "OPTIONAL" in this document are to be interpreted as described in
     <draft-bradner-key-words-03.txt> [KEY WORDS].














                      Expires 10 Sept 1997                      [Page 1]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

  Abstract

     The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC
     1123 Section 4 [RFC1123], is one of the oldest and widely used
     protocols on the Internet. The protocol's primary character set, 7
     bit ASCII, has served the protocol well through the early growth
     years of the Internet. However, as the Internet becomes more
     global, there is a need to support character sets beyond 7 bit
     ASCII.

     This document addresses the internationalization (I18n) of FTP,
     which includes supporting the multiple character sets found
     throughout the Internet community.  This is achieved  by extending
     the FTP specification and giving recommendations for proper
     internationalization support.






  Table of Contents

  1 INTRODUCTION......................................................3
  2 INTERNATIONALIZATION..............................................3
   2.1 International Character Set....................................4
   2.2 Transfer Encoding..............................................4
  3 CONFORMANCE.......................................................5
   3.1 General........................................................5
   3.2 International Servers..........................................6
   3.3 International Clients..........................................7
  4 SECURITY..........................................................7
  5 ACKNOWLEDGMENTS...................................................8
  6 GLOSSARY..........................................................8
  7 BIBLIOGRAPHY......................................................8
  8 AUTHOR'S ADDRESS.................................................10
  APPENDIX A - IMPLEMENTATION CONSIDERATIONS........................A-1
  APPENDIX B - SAMPLE CODE AND EXAMPLES.............................B-1
   B.1 Valid UTF-8 check............................................B-1
   B.2 Conversions..................................................B-1
     B.2.1 Conversion from local character set to UTF-8.............B-1
     B.2.2 Conversion from UTF-8 to local character set.............B-4
     B.2.3  ISO/IEC 8859-8 Example..................................B-5
     B.2.4 Vendor Codepage Example..................................B-6
   B.3 Pseudo Code for translating servers..........................B-7










                    Expires 10 September 1997                  [Page  2]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

  1 Introduction

     As the Internet grows throughout the world the requirement to
     support character sets outside of the ASCII [ASCII] / Latin-1
     [ISO-8859] character set becomes ever more urgent.  For FTP,
     because of the large installed base, it is paramount that this be
     done without breaking existing clients and servers.  This document
     addresses this need. In doing so it defines a solution which will
     still allow the installed base to interoperate with new
     international clients and servers.

     This document enhances the capabilities of the File Transfer
     Protocol by removing the 7-bit restrictions on pathnames used in
     client commands and server responses, recommending the use of a
     Universal Character Set (UCS) ISO/IEC 10646 [ISO-10646], and
     recommending a UCS transformation format (UTF) UTF-8 [UTF-8].

     The recommendations made in this document are consistent with the
     recommendations expressed by the 29 Feb - 1 Mar 1996 IAB Character
     Set Workshop as expressed in <draft-weider-iab-char-wrkshop-
     00.txt> [WORKSHOP].

  2 Internationalization

     The File Transfer Protocol was developed in a period when the
     predominate character sets were 7 bit ASCII and 8 bit EBCDIC.
     Today these character sets cannot support the wide range of
     characters needed by multinational systems. Given that there are a
     number of character sets in current use that provide more
     characters than 7-bit ASCII, it makes sense to decide on a
     convenient way to represent the union of those possibilities. To
     work globally either requires support of a number of character
     sets and to be able to convert between them, or the use of a
     single preferred character set. To assure global interoperability
     this document RECOMMENDS the latter approach and defines a single
     character set, in addition to NVT ASCII and EBCDIC, which is
     understandable by all systems. For FTP this character set SHALL be
     ISO/IEC 10646:1993. For support of global compatibility it is
     STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding
     when exchanging pathnames. Clients and servers are, however, under
     no obligation to perform any conversion on the contents of a file
     for operations such as STOR or RETR.

     The character set used to store files SHALL remain a local
     decision and MAY depend on the capability of local operating
     systems. Prior to the exchange of pathnames they should be
     converted into a ISO/IEC 10646 format and UTF-8 encoded. This
     approach, while allowing international exchange of pathnames, will
     still allow backward compatibility with older systems because the
     code set positions for ASCII characters are identical to the one
     byte sequence in UTF-8.

     Sections 2.1 and 2.2 give a brief description of the international
     character set and transfer encoding recommended by this document.

                    Expires 10 September 1997                  [Page  3]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

     A more thorough description of UTF-8, ISO/IEC 10646, and UNICODE
     [UNICODE], beyond that given in this document, can be found in RFC
     2044 [RFC2044].

  2.1  International Character Set

     The character set defined for international support of FTP SHALL
     be the Universal Character Set as defined in ISO 10646:1993 as
     amended. This standard incorporates the character sets of many
     existing international, national, and corporate standards. ISO/IEC
     10646 defines two alternate forms of encoding, UCS-4 and UCS-2.
     UCS-4 is a four byte (31 bit) encoding containing 2**31 code
     positions divided into 128 groups of 256 planes. Each plane
     consists of 256 rows of 256 cells. UCS-2 is a 2 byte (16 bit)
     character set consisting of plane zero or the Basic Multilingual
     Plane (BMP).  Currently, no codesets have been defined outside of
     the 2 byte BMP.

     The Unicode standard version 2.0 [UNICODE] is consistent with the
     UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0
     includes the repertoire of IS 10646 characters, amendments 1-7 of
     IS 10646, and editorial and technical corrigenda.


  2.2  Transfer Encoding

     UCS Transformation Format 8 (UTF-8), also known as UTF-2 or UTF-
     FSS, SHALL be used as a transfer encoding to transmit the
     international character set. UTF-8 is a file safe encoding which
     avoids the use of byte values which have special significance
     during the parsing of pathname character strings. UTF-8 is an 8
     bit encoding of the characters in the UCS. Some of UTF-8's
     benefits are that it is compatible with 7 bit ASCII, so it doesn't
     affect programs that give special meanings to various ASCII
     characters; it is immune to synchronization errors; its encoding
     rules allow for easy identification; and it has enough space to
     support large character sets.

     UTF-8 encoding represents each UCS character as a sequence of 1 to
     6 bytes in length. For all sequences of one byte the most
     significant bit is ZERO. For all sequences of more than one byte
     the number of ONE bits in the first byte, starting from the most
     significant bit position, indicates the number of bytes in the
     UTF-8 sequence followed by a ZERO bit. For example, the first byte
     of a 3 byte UTF-8 sequence would have 1110 as its most significant
     bits. Each additional bytes (continuing bytes) in the UTF-8
     sequence, contain a ONE bit followed by a ZERO bit as their most
     significant bits. The remaining free bit positions in the
     continuing bytes are used to identify characters in the UCS. The
     relationship between UCS and UTF-8 is demonstrated in the
     following table:




                    Expires 10 September 1997                  [Page  4]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

     UCS-4 range(hex)          UTF-8 byte sequence(binary)
     00000000 - 0000007F       0xxxxxxx
     00000080 - 000007FF       110xxxxx 10xxxxxx
     00000800 - 0000FFFF       1110xxxx 10xxxxxx 10xxxxxx
     00010000 - 001FFFFF       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
     00200000 - 03FFFFFF       111110xx 10xxxxxx 10xxxxxx 10xxxxxx
                               10xxxxxx
     04000000 - 7FFFFFFF       1111110x 10xxxxxx 10xxxxxx 10xxxxxx
                               10xxxxxx 10xxxxxx



     A beneficial property of UTF-8 is that its single byte sequence is
     consistent with the ASCII character set. This feature will allow a
     transition where old ASCII-only clients can still interoperate
     with new servers which support the UTF-8 encoding.

     Another feature is that the encoding rules make it very unlikely
     that a character sequence from a different character set will be
     mistaken for a UTF-8 encoded character sequence. Clients and
     servers can use a simple routine to determine if the character set
     being exchanged is valid UTF-8. Section B.1 shows a code example
     of this check.


  3 Conformance

  3.1  General

     - The 7-bit restriction for pathnames exchanged is dropped.

     - Many operating system allow the use of spaces <SP>, carriage
       return <CR>, and line feed <LF> characters as part of the
       pathname. The exchange of pathnames with these special command
       characters will cause the pathnames to be parsed improperly. This
       is because ftp commands associated with pathnames have the form:

                 COMMAND <SP> <pathname> <CRLF>.

       To allow the exchange of pathnames containing these characters,
       the definition of pathname is changed from
         <pathname> ::=3D <string>
       to
         <pathname> ::=3D <ext-string>
         <ext-string> ::=3D <ext-char> | <ext-char><ext-string>
         <ext-char> ::=3D any 8 bit sequence except 0x00.

       To avoid mistaking these characters within pathnames as special
       command characters the following rules will apply:

          There MUST be only one <SP> between a ftp command and the
          pathname. Implementations MUST assume <SP> characters
          following the initial <SP> as part of the pathname. For


                    Expires 10 September 1997                  [Page  5]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

          example the pathname in STOR <SP><SP><SP>foo.bar<CRLF> is
          <SP><SP>foo.bar .

          Current implementations which may allow multiple <SP>
          characters as separators between the command and pathname MUST
          assure that they comply with this single <SP> convention.

      When a <CR> character is encountered as part of a pathname it
       MUST be padded with a <NUL> character prior to sending the
       command. On receipt of a pathname containing a <CR><NUL> sequence
       the <NUL> character MUST be stripped away. This approach is
       described in RFC 854 [RFC854] on pages 11 and 12. For example, to
       store a pathname foo<CR><LF>boo.bar the pathname would become
       foo<CR><NUL><LF>boo.bar prior to sending the command STOR
       <SP>foo<CR><NUL><LF>boo.bar<CRLF> .
       Upon receipt of the altered pathname the <NUL> character
       following the <CR> would be stripped away to form the original
       pathname.

     - Conforming internationalized clients and servers MUST support
       UTF-8 for the transfer and receipt of pathnames. Clients and
       servers MAY in addition give the user a choice to specify
       interpretation of pathnames in another encoding. Note that
       configuring clients and servers to use other character sets /
       encoding other than UTF-8 is outside of the scope of this
       document. While it is recognized that in certain operational
       scenarios this may be desirable, this is left as a quality of
       implementation and operational issue.

     - Pathnames are sequences of bytes.  The encoding of names that
       are valid UTF-8 sequences is assumed to be UTF-8.  The character
       set of other names is undefined. Clients and servers, unless
       otherwise configured to support a specific native character set,
       MUST check for a valid UTF-8 byte sequence to determine if the
       pathname being presented is UTF-8.

     - To avoid data loss, clients and servers SHOULD use the UTF-8
       encoded pathnames when unable to convert them to a usable code
       set.

     - There may be cases when the code set / encoding presented to the
       server or client cannot be determined. In such cases the raw
       bytes SHOULD be used.


  3.2  International Servers

     - Mirror servers may want to exactly reflect the site that they
       are mirroring. In such cases servers MAY store and present the
       exact pathname bytes that it received from the main server.





                    Expires 10 September 1997                  [Page  6]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

  3.3  International Clients

     - Clients which do not require display of pathnames are under no
       obligation to do so. Non-display clients do not need to conform
       to requirements associated with display.

     - Clients which are presented UTF-8 pathnames by the server SHOULD
       parse UTF-8 correctly, and attempt to display the pathname within
       the limitation of the resources available.

     - Character semantics of other names shall remain undefined. If a
       client detects that a server is non UTF-8, it SHOULD change its
       display appropriately. How a client implementation handles non
       UTF-8 is a quality of implementation issue. It MAY try to assume
       some other encoding, give the user a chance to try to assume
       something, or save encoding assumptions for a server from one FTP
       session to another.

     - Glyph rendering is outside the scope of this document. How a
       client presents characters it cannot display is a quality of
       implementation issue. This document RECOMMENDS that octets
       corresponding to non-displayable characters SHOULD be presented
       in URL %HH format defined in RFC 1738 [RFC1738]. They MAY,
       however, display them as question marks, with their UCS
       hexadecimal value, or in any other suitable fashion.

     - Many existing clients interpret 8-bit pathnames as being in the
       local character set. They MAY continue to do so for pathnames
       that are not valid UTF-8.


  4 Security

     This document addresses the support of character sets beyond 1
     byte. Conformance to this document should not induce a security
     threat.



















                    Expires 10 September 1997                  [Page  7]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

  5 Acknowledgments

     The following people have contributed to this document:

     Alex Belits
     D. J. Berstein
     Martin J. Duerst
     Mark Harris
     Paul Hethmon
     Alun Jones
     James Matthews
     Keith Moore
     Sandra O'Donnell
     Benjamin Riefenstahl
     Stephen Tihor

     (and others from the FTPEXT working group)

  6 Glossary

     BIDI - abbreviation for Bi-directional, a reference to mixed right-
     to-left or left-to-right text.

     Character Set / Charset -  a collection of characters used to
     represent textual information in which each character has a numeric
     value

     Code Set -  (see character set).

     Glyph - a character image represented on a display device.

     I18N - "I eighteen N", the first and last letters of the word
     "internationalization" and the eighteen letters in between.

     UCS-2 - the ISO/IEC 10646 two octet Universal Character Set form.

     UCS-4 - the ISO/IEC 10646 four octet Universal Character Set form.

     UTF-8 - the UCS Transformation Format represented in 8 bits.

     UTF-16 - the USC-2 encoding equivalent with UNICODE.



  7 Bibliography

     [ASCII]

       ANSI X3.4:1986 Coded Character Sets - 7 Bit American National
       Standard Code for Information Interchange (7-bit ASCII)





                    Expires 10 September 1997                  [Page  8]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997


     [ISO-8859]

       ISO 8859.  International standard -- Information processing --
       8-bit single-byte coded graphic character sets -- Part 1: Latin
       alphabet No. 1 (1987) -- Part 2: Latin alphabet No. 2 (1987) --
       Part 3: Latin alphabet No. 3 (1988) -- Part 4: Latin alphabet No.
       4 (1988) -- Part 5: Latin/Cyrillic alphabet (1988) -- Part 6:
       Latin/Arabic alphabet (1987) -- Part : Latin/Greek alphabet
       (1987) -- Part 8: Latin/Hebrew alphabet (1988) -- Part 9: Latin
       alphabet No. 5 (1989) -- Part10: Latin alphabet No. 6 (1992)

     [ISO-10646]

       ISO/IEC 10646-1:1993. International standard -- Information
       technology -- Universal multiple-octet coded character set (UCS)
       -- Part 1: Architecture and basic multilingual plane.

     [KEY WORDS]

       S. Bradner, " Key words for use in RFCs to Indicate Requirement
       Levels", Work in Progress <draft-bradner-key-words-03.txt>,
       January 1997.

     [RFC854]

       J. Postel, J Reynolds, "Telnet Protocol Specification", RFC 854,
       May 1983.


     [RFC959]

       J. Postel, J Reynolds, "File Transfer Protocol (FTP)", RFC 959,
       October 1985.

     [RFC1123]

       R. Braden, "Requirements for Internet Hosts -- Application and
       Support", RFC 1123, October 1989.

     [RFC1738]

       T. Berners-Lee, L. Masinter, M.McCahill, "Uniform Resource
       Locators (URL)", RFC 1738, December 1994.

     [RFC2044]

       F. Yergeau, "UTF-8, a transformation format of Unicode and ISO
       10646", RFC 2044, October 1996.






                    Expires 10 September 1997                  [Page  9]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

     [UNICODE]

       The Unicode Consortium, "The Unicode Standard - Version 2.0",
       Addison Westley Developers Press, July 1996.

     [UTF-8]

       ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transformation
       Format 8 (UTF-8).

     [WORKSHOP]

       C. Weider, " The Report of the IAB Character Set Workshop held 29
       February - 1 March, 1996", Work in Progress <draft-weider-iab-
       char-wrkshop-00.txt>, November 1996.





  8 Author's Address

     JIEO
     Attn JEBBD (Bill Curtin)
     Ft. Monmouth, N.J.
          07703-5613
     curtinw@ftm.disa.mil




























                    Expires 10 September 1997                 [Page  10]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997


                Appendix A - Implementation Considerations




     - Implementers should be aware that ISO 10646 is amended from time
       to time; 4 amendments have been adopted since the initial 1993
       publication, none of which significantly affects this
       specification.  A fifth amendment, now under consideration, will
       introduce incompatible changes to the standard: 6556 Korean
       Hangul syllables allocated between code positions 3400 and 4DFF
       (hexadecimal) will be moved to new positions (and 4516 new
       syllables added), thus making references to the old positions
       invalid.  Since the Unicode consortium has already adopted the
       corresponding amendment in Unicode 2.0, adoption of DAM 5 is
       considered likely and implementers should probably consider the
       old code positions as already invalid.  Despite this one-time
       change, the relevant standard bodies have committed themselves
       not to change any allocated code position in the future.  To
       encode Korean Hangul irrespective of these changes, the
       conjoining Hangul Jamo in the range 1110-11F9 can be used.

     - Implementers should ensure that their code accounts for
       potential problems, such as using a NULL character to terminate a
       string or no longer being able to steal the high order bit for
       internal use, when supporting the extended character set.

     - Implementers should be aware that there is a chance that
       pathnames which are non UTF-8 may be parsed as valid UTF-8. The
       probability is low for some encoding or statistically zero to
       zero for others. A recent non-scientific analysis found that EUC
       encoded words had a 2.7% false reading; SJIS had a 0.0005% false
       reading; other encoding such as ASCII or KOI-8 have a 0% false
       reading. This probability is highest for short pathnames and
       decreases as pathname size increases. Implementers may want to
       look for signs that pathnames which parsed as UTF-8 are not valid
       UTF-8, such as the existence of multiple local character sets in
       short pathnames. Hopefully, as more implementations conform to
       UTF-8 transfer encoding there will be a smaller need to guess at
       the encoding.

     - Client developers should be aware that it will be possible for
       pathnames to contain mixed characters (e.g.
       /Latin1DirectoryName/HebrewFileName). They should be prepared to
       handle the Bi-directional (BIDI) display of these character sets
       (i.e. right to left display for the directory and left to right
       display for the filename). While bi-directional display is
       outside the scope of this document and more complicated than the
       above example, an algorithm for bi-directional display can be
       found in the UNICODE 2.0 [UNICODE] standard. Also note that
       pathnames can have different byte ordering yet be logically and
       display-wise equivalent due to the insertion of BIDI control
       characters at different points during composition. Also note that


                    Expires 10 September 1997                [Page  A-1]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

       mixed character sets may also present problems with font
       swapping.


     - A server that copies pathnames transparently from a local
       filesystem may continue to do so. It is then up to the local file
       creators to use UTF-8 pathnames.

     - Servers can supports charset labeling of files and/or
       directories, such that different file names may have different
       charsets. The server should attempt to convert all file names to
       UTF-8, but if it can't then it should leave that name in its raw
       form.

      - Server's OS that do not mandate the character set, but the
       administrator can configures it in the FTP server. The server
       should be configured to use a particular mapping table.  (Maybe
       external, but the server might have some common choices
       built-in.)  This also allows the flexibility of defining
       different charsets for different directories.

     - If the server's OS does not mandate the character set and it is
       not configured. The server should simply use the raw bytes in the
       file name.  They might be ASCII or UTF-8.

     - Server is a mirror, and wants to look just like the site it is
       mirroring. It should save the exact file name bytes that it
       received from the main server.



























                    Expires 10 September 1997                [Page  A-2]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997


                   Appendix B - Sample Code and Examples


  B.1  Valid UTF-8 check

     The following routine checks if a byte sequence is valid UTF-8.

     int utf8_valid(const unsigned char *buf, unsigned int len)
     {
       const unsigned char *endbuf =3D buf + len;
       int trailing =3D 0;    // trailing (continuation) bytes to follow

       while (buf !=3D endbuf)
       {
          unsigned char c =3D *buf++;
          if (trailing)
            if      ((c&0xC0) =3D=3D 0x80)  trailing--;
            else                        return 0;
          else
            if      ((c&0x80) =3D=3D 0x00)  continue;
            else if ((c&0xE0) =3D=3D 0xC0)  trailing =3D 1;
            else if ((c&0xF0) =3D=3D 0xE0)  trailing =3D 2;
            else if ((c&0xF8) =3D=3D 0xF0)  trailing =3D 3;
            else if ((c&0xFC) =3D=3D 0xF8)  trailing =3D 4;
            else if ((c&0xFE) =3D=3D 0xFC)  trailing =3D 5;
            else                        return 0;
       }
       return trailing =3D=3D 0;
     }

  B.2  Conversions

     The code examples in this section closely reflect the algorithm in
     ISO 10646 and may not present the most efficient solution for
     converting to / from UTF-8 encoding. If efficiency is an issue,
     implementers should use the appropriate bitwise operators.

     Additional code examples and numerous mapping tables can be found
     at the Unicode site, HTTP://www.unicode.org or FTP://unicode.org.

     Note that the conversion examples below assume that the local
     character set supported in the operating system is something other
     than UCS2/UTF-16. There are some operating systems which already
     support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case
     no conversion will be necessary from the local character set to
     the UCS.

  B.2.1  Conversion from local character set to UTF-8

     Conversion from the local filesystem character set to UTF-8 will
     normally involve a two step process. First convert the local
     character set to the UCS; then convert the UCS to UTF-8.



                    Expires 10 September 1997                [Page  B-1]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

     The first step in the process can be performed by maintaining a
     mapping table which includes the local character set code and the
     corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859]
     code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte
     ISO/IEC 10646 code is 0x000005D5.

     The next step is to convert the UCS character code to the UTF-8
     encoding. The following routine can be used to determine and
     encode the correct number of bytes based on the UCS-4 character
     code:


  unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int
                             ucs4_len, unsigned char *utf8_buf)

  {
   const unsigned long *ucs4_endbuf =3D ucs4_buf + ucs4_len;
   unsigned int utf8_len =3D 0;            // return value for UTF8 size
   unsigned char *t_utf8_buf =3D utf8_buf; // Temporary pointer
                                          // to load UTF8 values

   while (ucs4_buf !=3D ucs4_endbuf)
   {
     if ( *ucs4_buf <=3D 0x7F)    // ASCII chars no conversion needed
     {
      *t_utf8_buf++ =3D (unsigned char) *ucs4_buf;
      utf8_len++;
      ucs4_buf++;
     }
     else
      if ( *ucs4_buf <=3D 0x07FF )  // In the 2 byte utf-8 range
      {
       *t_utf8_buf++=3D (unsigned char) (0xC0 + (*ucs4_buf/0x40));
       *t_utf8_buf++=3D (unsigned char) (0x80 + (*ucs4_buf%0x40));
       utf8_len+=3D2;
       ucs4_buf++;
      }
      else
       if ( *ucs4_buf <=3D 0xFFFF ) /* In the 3 byte utf-8 range. The
                                     values 0x0000FFFE, 0x0000FFFF
                                     and 0x0000D800 - 0x0000DFFF do
                                     not occur in UCS-4 */
       {
        *t_utf8_buf++=3D (unsigned char) (0xE0 +  (*ucs4_buf/0x1000));
        *t_utf8_buf++=3D (unsigned char) (0x80 +
                       ((*ucs4_buf/0x40)%0x40));
        *t_utf8_buf++=3D (unsigned char) (0x80 + (*ucs4_buf%0x40));
        utf8_len+=3D3;
        ucs4_buf++;

       }
       else
        if ( *ucs4_buf <=3D 0x1FFFFF ) //In the 4 byte utf-8 range
        {

                    Expires 10 September 1997                [Page  B-2]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

          *t_utf8_buf++=3D (unsigned char) (0xF0 +  (*ucs4_buf/0x040000))=
;
          *t_utf8_buf++=3D (unsigned char) (0x80 +
                         ((*ucs4_buf/0x10000)%0x40));
          *t_utf8_buf++=3D (unsigned char) (0x80 +
                         ((*ucs4_buf/0x40)%0x40));
          *t_utf8_buf++=3D (unsigned char) (0x80 + (*ucs4_buf%0x40));
          utf8_len+=3D4;
          ucs4_buf++;

        }
        else
          if ( *ucs4_buf <=3D 0x03FFFFFF )//In the 5 byte utf-8 range
          {
           *t_utf8_buf++=3D (unsigned char) (0xF8 +
                          (*ucs4_buf/0x01000000));
           *t_utf8_buf++=3D (unsigned char) (0x80 +
                          ((*ucs4_buf/0x040000)%0x40));
           *t_utf8_buf++=3D (unsigned char) (0x80 +
                          ((*ucs4_buf/0x1000)%0x40));
           *t_utf8_buf++=3D (unsigned char) (0x80 +
                          ((*ucs4_buf/0x40)%0x40));
           *t_utf8_buf++=3D (unsigned char) (0x80 +
                          (*ucs4_buf%0x40));
           utf8_len+=3D5;
           ucs4_buf++;
          }
          else
          if ( *ucs4_buf <=3D 0x7FFFFFFF )//In the 6 byte utf-8 range
           {
            *t_utf8_buf++=3D (unsigned char)
                           (0xF8 +(*ucs4_buf/0x40000000));
            *t_utf8_buf++=3D (unsigned char) (0x80 +
                           ((*ucs4_buf/0x01000000)%0x40));
            *t_utf8_buf++=3D (unsigned char) (0x80 +
                           ((*ucs4_buf/0x040000)%0x40));
            *t_utf8_buf++=3D (unsigned char) (0x80 +
                           ((*ucs4_buf/0x1000)%0x40));
            *t_utf8_buf++=3D (unsigned char) (0x80 +
                           ((*ucs4_buf/0x40)%0x40));
            *t_utf8_buf++=3D (unsigned char) (0x80 +
                           (*ucs4_buf%0x40));
            utf8_len+=3D6;
            ucs4_buf++;

           }
   }
   return (utf8_len);
  }







                    Expires 10 September 1997                [Page  B-3]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

  B.2.2  Conversion from UTF-8 to local character set


     When moving from UTF-8 encoding to the local character set the
     reverse procedure is used. First the UTF-8 encoding is transformed
     into the UCS-4 character set. The UCS-4 is then converted to the
     local character set from a mapping table (i.e. the opposite of the
     table used to form the UCS-4 character code).

     To convert from UTF-8 to UCS-4 the free bits (those that do not
     define UTF-8 sequence size or signify continuation bytes) in a
     UTF-8 sequence are concatenated as a bit string. The bits are then
     distributed into a four byte sequence starting from the least
     significant bits. Those bits not assigned a bit in the four byte
     sequence are padded with ZERO bits. The following routine converts
     the UTF-8 encoding to UCS-4 character codes:


  int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len,
                 unsigned char *utf8_buf)
  {

  const unsigned char *utf8_endbuf =3D utf8_buf + utf8_len;
  unsigned int ucs_len=3D0;


   while (utf8_buf !=3D utf8_endbuf)
   {

     if ((*utf8_buf & 0x80) =3D=3D 0x00)  /*ASCII chars no conversion
                                       needed */
     {
      *ucs4_buf++ =3D (unsigned long) *utf8_buf;
      utf8_buf++;
      ucs_len++;
     }
     else
      if ((*utf8_buf & 0xE0)=3D=3D 0xC0) //In the 2 byte utf-8 range
      {
       *ucs4_buf++ =3D (unsigned long) (((*utf8_buf - 0xC0) * 0x40)
                      + ( *(utf8_buf+1) - 0x80));
       utf8_buf +=3D 2;
       ucs_len++;
      }
      else
       if ( (*utf8_buf & 0xF0) =3D=3D 0xE0 ) /*In the 3 byte utf-8
                                           range */
       {
       *ucs4_buf++ =3D (unsigned long) (((*utf8_buf - 0xE0) * 0x1000)
                       + (( *(utf8_buf+1) -  0x80) * 0x40)
                       + ( *(utf8_buf+2) - 0x80));
        utf8_buf+=3D3;
        ucs_len++;
       }

                    Expires 10 September 1997                [Page  B-4]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997

       else
        if ((*utf8_buf & 0xF8) =3D=3D 0xF0) /* In the 4 byte utf-8
                                           range */
        {
          *ucs4_buf++ =3D (unsigned long)
                            (((*utf8_buf - 0xF0) * 0x040000)
                          + (( *(utf8_buf+1) -  0x80) * 0x1000)
                          + (( *(utf8_buf+2) -  0x80) * 0x40)
                          + ( *(utf8_buf+3) - 0x80));
          utf8_buf+=3D4;
          ucs_len++;
        }
        else
          if ((*utf8_buf & 0xFC) =3D=3D 0xF8) /* In the 5 byte utf-8
                                             range */
          {
           *ucs4_buf++ =3D (unsigned long)
                            (((*utf8_buf - 0xF8) * 0x01000000)
                          + ((*(utf8_buf+1) - 0x80) * 0x040000)
                          + (( *(utf8_buf+2) -  0x80) * 0x1000)
                          + (( *(utf8_buf+3) -  0x80) * 0x40)
                          + ( *(utf8_buf+4) - 0x80));
           utf8_buf+=3D5;
           ucs_len++;
          }
          else
           if ((*utf8_buf & 0xFE) =3D=3D 0xFC) /* In the 6 byte utf-8
                                              range */
           {

            *ucs4_buf++ =3D (unsigned long)
                             (((*utf8_buf - 0xFC) * 0x40000000)
                           + ((*(utf8_buf+1) - 0x80) * 0x010000000)
                           + ((*(utf8_buf+2) - 0x80) * 0x040000)
                           + (( *(utf8_buf+3) -  0x80) * 0x1000)
                           + (( *(utf8_buf+4) -  0x80) * 0x40)
                           + ( *(utf8_buf+5) - 0x80));
            utf8_buf+=3D6;
            ucs_len++;
           }

   }
  return (ucs_len);
  }

  B.2.3     ISO/IEC 8859-8 Example

     This example demonstrates mapping ISO/IEC 8859-8 character set to
     UTF-8 and back to ISO/IEC 8859-8. As noted earlier, the Hebrew
     letter "VAV" is convertd from the ISO/IEC 8859-8  character code
     0xE4 to the corresponding 4 byte ISO/IEC 10646 code of 0x000005D5
     by a simple lookup of a conversion/mapping file.

     The UCS-4 character code is transformed into UTF-8 using the
     ucs4_to_utf8 routine described earlier by:

                    Expires 10 September 1997                [Page  B-5]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997


       1. Because the UCS-4 character is between 0x80 and 0x07FF it will
          map to a 2 byte UTF-8 sequence.
       2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40)) =3D
          0xD7.
       3. The second byte is defined by (0x80 + (0x000005D5 % 0x40)) =3D
          0x95.

     The UTF-8 encoding is transferred back to UCS-4 by using the
     utf8_to_ucs4 routine described earlier by:

       1. Because the first byte of the sequence, when the '&' operator
          with a value of 0xE0 is applied, will produce 0xC0 (0xD7 &
          0xE0 =3D 0xC0) the UTF-8 is a 2 byte sequence.
       2.  The four byte UCS-4 character code is produced by (((0xD7 -
          0xC0) * 0x40) + (0x95 -0x80)) =3D 0x000005D5.

     Finally, the UCS-4 character code is converted to ISO/IEC 8859-8
     character code (using the mapping table which matches ISO/IEC
     8859-8 to UCS-4 ) to produce the original 0xE4 code for the Hebrew
     letter "VAV".

  B.2.4  Vendor Codepage Example

     This example demonstrates the mapping of a codepage to UTF-8  and
     back to a vendor codepage. Mapping between vendor codepages can be
     done in a very similar manner as described above. For instance
     both the PC and Mac codepages reflect the character set from the
     Thai standard TIS 620-2533. The character code on both platforms
     for the Thai letter "SO SO" is 0xAB. This character can then be
     mapped into the UCS-4 by way of a conversion/mapping file to
     produce the UCS-4 code of 0x0E0B.

     The UCS-4 character code is transformed into UTF-8 using the
     ucs4_to_utf8 routine described earlier by:

       1. Because the UCS-4 character is between 0x0800 and 0xFFFF it
          will map to a 3 byte UTF-8 sequence.
       2. The first byte is defined by (0xE0 + (0x00000E0B / 0x1000) =3D
          0xE0.
       3. The second byte is defined by (0x80 + ((0x00000E0B / 0x40) %
          0x40))) =3D 0xB8.
       4. The third byte is defined by (0x80 + (0x00000E0B % 0x40)) =3D
          0x8B.

     The UTF-8 encoding is transferred back to UCS-4 by using the
     utf8_to_ucs4 routine described earlier by:

       1. Because the first byte of the sequence, when the '&' operator
          with a value of 0xF0 is applied, will produce 0xE0 (0xE0 &
          0xF0 =3D 0xE0) the UTF-8 is a 3 byte sequence.
       2.  The four byte UCS-4 character code is produced by (((0xE0 -
          0xE0) * 0x1000) + ((0xB8 - 0x80) * 0x40) + (0x8B -0x80) =3D
          0x0000E0B.

                    Expires 10 September 1997                [Page  B-6]=0C=


  INTERNET DRAFT    FTP Internationalization               10 March 1997


     Finally, the UCS-4 character code is converted to either the PC or
     MAC codepage character code (using the mapping table which matches
     codepage to UCS-4 ) to produce the original 0xAB code for the Thai
     letter "SO SO".


  B.3  Pseudo Code for a high-quality translating server


  if utf8_valid(fn)
     {
     attempt to convert fn to the local charset, producing localfn
     if (conversion fails temporarily) return error
     if (conversion succeeds)
       {
       attempt to open localfn
       if (open fails temporarily) return error
       if (open succeeds) return success
       }
     }
  attempt to open fn
  if (open fails temporarily) return error
  if (open succeeds) return success
  return permanent error






























                    Expires 10 September 1997                [Page  B-7]


Html markup produced by rfcmarkup 1.109, available from https://tools.ietf.org/tools/rfcmarkup/