[Docs] [txt|pdf] [Tracker] [Email] [Nits]

Versions: 00

Internet Draft                                 Serge Winitzki
draft-winitzki-koi8c-encoding-00.txt
Expires: April 2002

                                        Extended Cyrillic Character Set
KOI8-C

Status of this Memo

  This memo is an Internet-Draft and is subject to all provisions
  of Section 10 of RFC2026.

  Internet-Drafts are working documents of the Internet
  Engineering  Task Force (IETF), its areas, and its working
  groups.  Note that     other groups may also distribute working
  documents as Internet-Drafts.

  Internet-Drafts are draft documents valid for a maximum of six
  months and may be updated, replaced, or obsoleted by other
  documents  at any time.  It is inappropriate to use
  Internet-Drafts as  reference material or to cite them other than
  as "work in progress."

  The list of current Internet-Drafts can be accessed at
  http://www.ietf.org/ietf/1id-abstracts.txt  The list of
  Internet-Draft Shadow Directories can be accessed at
  http://www.ietf.org/shadow.html.

Author

   Serge Winitzki <swinitzk@hotmail.com>

Abstract

  This document provides information about character encoding
  KOI8-C (KOI8 Cyrillic) proposed for use with Russian (including
  old orthography), Ukrainian, Belorussian, Serbian, Macedonian
  languages with special punctuation marks.  KOI8-C is compatible
  with KOI8-R [1] and KOI8-U [2] in the area of Russian, Ukrainian
  and Belorussian letters, and extends these with letters for old
  Russian orthography, Yugoslavian cyrillic letters and
  typographical symbols in positions compatible with CP1251 for use
  in legacy applications.

Proposed MIME character set name: koi8-c

Introduction

  This document provides information about a proposed new character
  encoding KOI8-C, an extension of the KOI8-R and KOI8-U standards.
  This extension provides support for all Russian letters
  (including those needed for old Russian orthography), as well as
  Cyrillic letters used in Belorussian, Macedonian, Serbian and
  Ukrainian languages, and certain frequently used typographic
  characters borrowed from the CP1251 encoding. The KOI8-C encoding
  is compatible with the existing KOI8-RU and CP1251 encodings in
  the relevant characters.

Motivation

  The KOI8 family of encodings has long been used for electronic
  exchange of Cyrillic texts [1,2]. The following considerations
  have led the author to propose an extension to KOI8.

  1) A large area of the KOI8 encoding table (most of the 0x80-0xBF
  range) is, for historical reasons, occupied by symbols of
  pseudographics which are unused in modern software. These symbols
  are missing in most KOI8 font implementations without any impact
  on user productivity. These places in the encoding table could be
  utilized to represent more frequently used characters.

  2) The recent dominance of the "MS Windows" operating environment
  resulted in a wide adoption of word processors that use the "code
  page 1251" encoding to render Cyrillics. Many Internet documents
  are thus converted to KOI8 from CP1251 and frequently include
  certain typographical signs such as apostrophes, quotes, or
  dashes, not represented in the KOI8 encodings but left without
  change by automatic converters. These typographical symbols fall
  in the unused KOI8 pseudographics area.

  3) Texts in old Russian orthography (pre-1918) contain four
  Cyrillic letters not represented by any of the widely used
  Cyrillic encodings. Although Unicode-based tools would in
  principle be adequate for rendering these characters, the current
  software is mostly lacking the necessary support. It would be
  convenient to have an 8-bit encoding representing the old Russian
  characters and to be able to place them directly into a font
  encoding map and a keyboard layout compatible with a wide range
  of current software.

Implementation

  The author has implemented the KOI8-C encoding according to these
  guidelines: (1) compatibility with KOI8-R and KOI8-U character
  sets, (2) compatibility with CP1251 character set in the area of
  typographical symbols and Yugoslavian Cyrillics; (3) need to be
  able to convert fonts to other Cyrillic encodings.

  The lower part of the KOI8-C character set is a complete copy of
  ASCII in the range of printable characters (0x20 -- 0x7F). The
  range (0x00 -- 0x1F) is occupied by pseudographics and other
  rarely used special symbols.

  The upper part of the KOI8-C character set contains all Russian,
  Belarussian and Ukrainian letters at positions defined in KOI8-R
  and KOI8-U; frequently used typographical symbols (quotes,
  dashes, and currency symbols) and Yugoslavian Cyrillics as
  defined by the CP1251 encoding; and old Russian letters. Most box
  drawing characters from KOI8-R, as well as some mathematical
  symbols, were removed.

  The resulting character set contains all ISO 8859-5 characters
  except for SOFT HYPHEN and covers CP1251 except for 5 punctuation
  characters (all also in CP1252).

  The Web page
  <http://www.geocities.com/CapeCanaveral/5735/1/koi8-extended.html>
  contains the author's development efforts related to the KOI8-C
  encoding and texts in old Russian orthography. The free bitmap
  fonts of the Cronyx family for the X window system were adapted
  to the KOI8-C encoding, implementing a full KOI8-C map (256
  characters) in all fonts (the "xcyr" project). An extension of
  the keyboard layout containing the old Russian letters was
  proposed. A spellchecking dictionary for the old Russian
  orthography using the KOI8-C encoding was developed.

Relation to other efforts

  This encoding was designed as a modification of [1,2]. An
  independent font development project "CYR-RFX" is using an
  alternative encoding "KOI8-O" with similar objectives of
  compatibility with KOI8-R and CP1251 but not containing any
  Yugoslavian Cyrillic characters.

Specification of the KOI8-C codepage

  The description of all characters of upper half part of KOI8-C
  codepage is given according to ISO 10646 Unicode Character Set
  (UCS).

  <hex> <UCS>   #       <description>

  0x01  U25C6   #       BLACK DIAMOND
  0x02  U2592   #       MEDIUM SHADE
  0x03  U00D7   #       MULTIPLICATION SIGN
  0x04  U00F7   #       DIVISION SIGN
  0x05  U2030   #       PER MILLE SIGN
  0x06  U2248   #       ALMOST EQUAL TO
  0x07  U00B5   #       MICRO SIGN
  0x08  U00B1   #       PLUS-MINUS SIGN
  0x09  U00B6   #       PILCROW SIGN
  0x0A  U2021   #       DOUBLE DAGGER
  0x0B  U2518   #       BOX DRAWINGS LIGHT UP AND LEFT
  0x0C  U2510   #       BOX DRAWINGS LIGHT DOWN AND LEFT
  0x0D  U250C   #       BOX DRAWINGS LIGHT DOWN AND RIGHT
  0x0E  U2514   #       BOX DRAWINGS LIGHT UP AND RIGHT
  0x0F  U253C   #       BOX DRAWINGS LIGHT VERTICAL AND HORIZONTAL
  0x10  UFFFD   #       REPLACEMENT CHARACTER
  0x11  UFFFD   #       REPLACEMENT CHARACTER
  0x12  U2500   #       BOX DRAWINGS LIGHT HORIZONTAL
  0x13  UFFFD   #       REPLACEMENT CHARACTER
  0x14  UFFFD   #       REPLACEMENT CHARACTER
  0x15  U251C   #       BOX DRAWINGS LIGHT VERTICAL AND RIGHT
  0x16  U2524   #       BOX DRAWINGS LIGHT VERTICAL AND LEFT
  0x17  U2534   #       BOX DRAWINGS LIGHT UP AND HORIZONTAL
  0x18  U252C   #       BOX DRAWINGS LIGHT DOWN AND HORIZONTAL
  0x19  U2502   #       BOX DRAWINGS LIGHT VERTICAL
  0x1A  U2264   #       LESS-THAN OR EQUAL TO
  0x1B  U2265   #       GREATER-THAN OR EQUAL TO
  0x1C  U03C0   #       GREEK SMALL LETTER PI
  0x1D  U2260   #       NOT EQUAL TO
  0x1E  U00A4   #       CURRENCY SIGN
  0x1F  U00B2   #       SUPERSCRIPT TWO
  0x20  U0020   #       SPACE
  0x21  U0021   #       EXCLAMATION MARK
  0x22  U0022   #       QUOTATION MARK
  0x23  U0023   #       NUMBER SIGN
  0x24  U0024   #       DOLLAR SIGN
  0x25  U0025   #       PERCENT SIGN
  0x26  U0026   #       AMPERSAND
  0x27  U0027   #       APOSTROPHE
  0x28  U0028   #       LEFT PARENTHESIS
  0x29  U0029   #       RIGHT PARENTHESIS
  0x2A  U002A   #       ASTERISK
  0x2B  U002B   #       PLUS SIGN
  0x2C  U002C   #       COMMA
  0x2D  U002D   #       HYPHEN-MINUS
  0x2E  U002E   #       FULL STOP
  0x2F  U002F   #       SOLIDUS
  0x30  U0030   #       DIGIT ZERO
  0x31  U0031   #       DIGIT ONE
  0x32  U0032   #       DIGIT TWO
  0x33  U0033   #       DIGIT THREE
  0x34  U0034   #       DIGIT FOUR
  0x35  U0035   #       DIGIT FIVE
  0x36  U0036   #       DIGIT SIX
  0x37  U0037   #       DIGIT SEVEN
  0x38  U0038   #       DIGIT EIGHT
  0x39  U0039   #       DIGIT NINE
  0x3A  U003A   #       COLON
  0x3B  U003B   #       SEMICOLON
  0x3C  U003C   #       LESS-THAN SIGN
  0x3D  U003D   #       EQUALS SIGN
  0x3E  U003E   #       GREATER-THAN SIGN
  0x3F  U003F   #       QUESTION MARK
  0x40  U0040   #       COMMERCIAL AT
  0x41  U0041   #       LATIN CAPITAL LETTER A
  0x42  U0042   #       LATIN CAPITAL LETTER B
  0x43  U0043   #       LATIN CAPITAL LETTER C
  0x44  U0044   #       LATIN CAPITAL LETTER D
  0x45  U0045   #       LATIN CAPITAL LETTER E
  0x46  U0046   #       LATIN CAPITAL LETTER F
  0x47  U0047   #       LATIN CAPITAL LETTER G
  0x48  U0048   #       LATIN CAPITAL LETTER H
  0x49  U0049   #       LATIN CAPITAL LETTER I
  0x4A  U004A   #       LATIN CAPITAL LETTER J
  0x4B  U004B   #       LATIN CAPITAL LETTER K
  0x4C  U004C   #       LATIN CAPITAL LETTER L
  0x4D  U004D   #       LATIN CAPITAL LETTER M
  0x4E  U004E   #       LATIN CAPITAL LETTER N
  0x4F  U004F   #       LATIN CAPITAL LETTER O
  0x50  U0050   #       LATIN CAPITAL LETTER P
  0x51  U0051   #       LATIN CAPITAL LETTER Q
  0x52  U0052   #       LATIN CAPITAL LETTER R
  0x53  U0053   #       LATIN CAPITAL LETTER S
  0x54  U0054   #       LATIN CAPITAL LETTER T
  0x55  U0055   #       LATIN CAPITAL LETTER U
  0x56  U0056   #       LATIN CAPITAL LETTER V
  0x57  U0057   #       LATIN CAPITAL LETTER W
  0x58  U0058   #       LATIN CAPITAL LETTER X
  0x59  U0059   #       LATIN CAPITAL LETTER Y
  0x5A  U005A   #       LATIN CAPITAL LETTER Z
  0x5B  U005B   #       LEFT SQUARE BRACKET
  0x5C  U005C   #       REVERSE SOLIDUS
  0x5D  U005D   #       RIGHT SQUARE BRACKET
  0x5E  U005E   #       CIRCUMFLEX ACCENT
  0x5F  U005F   #       LOW LINE
  0x60  U0060   #       GRAVE ACCENT
  0x61  U0061   #       LATIN SMALL LETTER A
  0x62  U0062   #       LATIN SMALL LETTER B
  0x63  U0063   #       LATIN SMALL LETTER C
  0x64  U0064   #       LATIN SMALL LETTER D
  0x65  U0065   #       LATIN SMALL LETTER E
  0x66  U0066   #       LATIN SMALL LETTER F
  0x67  U0067   #       LATIN SMALL LETTER G
  0x68  U0068   #       LATIN SMALL LETTER H
  0x69  U0069   #       LATIN SMALL LETTER I
  0x6A  U006A   #       LATIN SMALL LETTER J
  0x6B  U006B   #       LATIN SMALL LETTER K
  0x6C  U006C   #       LATIN SMALL LETTER L
  0x6D  U006D   #       LATIN SMALL LETTER M
  0x6E  U006E   #       LATIN SMALL LETTER N
  0x6F  U006F   #       LATIN SMALL LETTER O
  0x70  U0070   #       LATIN SMALL LETTER P
  0x71  U0071   #       LATIN SMALL LETTER Q
  0x72  U0072   #       LATIN SMALL LETTER R
  0x73  U0073   #       LATIN SMALL LETTER S
  0x74  U0074   #       LATIN SMALL LETTER T
  0x75  U0075   #       LATIN SMALL LETTER U
  0x76  U0076   #       LATIN SMALL LETTER V
  0x77  U0077   #       LATIN SMALL LETTER W
  0x78  U0078   #       LATIN SMALL LETTER X
  0x79  U0079   #       LATIN SMALL LETTER Y
  0x7A  U007A   #       LATIN SMALL LETTER Z
  0x7B  U007B   #       LEFT CURLY BRACKET
  0x7C  U007C   #       VERTICAL LINE
  0x7D  U007D   #       RIGHT CURLY BRACKET
  0x7E  U007E   #       TILDE
  0x7F  U00AC   #       NOT SIGN
  0x80  U0402   #       CYRILLIC CAPITAL LETTER DJE
  0x81  U0403   #       CYRILLIC CAPITAL LETTER GJE
  0x82  U00B8   #       CEDILLA
  0x83  U0453   #       CYRILLIC SMALL LETTER GJE
  0x84  U201E   #       DOUBLE LOW-9 QUOTATION MARK
  0x85  U2026   #       HORIZONTAL ELLIPSIS
  0x86  U2020   #       DAGGER
  0x87  U00A7   #       SECTION SIGN
  0x88  U20AC   #       EURO SIGN
  0x89  U00A8   #       DIAERESIS
  0x8A  U0409   #       CYRILLIC CAPITAL LETTER LJE
  0x8B  U2039   #       SINGLE LEFT-POINTING ANGLE QUOTATION MARK
  0x8C  U040A   #       CYRILLIC CAPITAL LETTER NJE
  0x8D  U040C   #       CYRILLIC CAPITAL LETTER KJE
  0x8E  U040B   #       CYRILLIC CAPITAL LETTER TSHE
  0x8F  U040F   #       CYRILLIC CAPITAL LETTER DZHE
  0x90  U0452   #       CYRILLIC SMALL LETTER DJE
  0x91  U2018   #       LEFT SINGLE QUOTATION MARK
  0x92  U2019   #       RIGHT SINGLE QUOTATION MARK
  0x93  U201C   #       LEFT DOUBLE QUOTATION MARK
  0x94  U201D   #       RIGHT DOUBLE QUOTATION MARK
  0x95  U2022   #       BULLET
  0x96  U2013   #       EN DASH
  0x97  U2014   #       EM DASH
  0x98  U00A3   #       POUND SIGN
  0x99  U00B7   #       MIDDLE DOT
  0x9A  U0459   #       CYRILLIC SMALL LETTER LJE
  0x9B  U203A   #       SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
  0x9C  U045A   #       CYRILLIC SMALL LETTER NJE
  0x9D  U045C   #       CYRILLIC SMALL LETTER KJE
  0x9E  U045B   #       CYRILLIC SMALL LETTER TSHE
  0x9F  U045F   #       CYRILLIC SMALL LETTER DZHE
  0xA0  U00A0   #       NO-BREAK SPACE
  0xA1  U0475   #       CYRILLIC SMALL LETTER IZHITSA
  0xA2  U0463   #       CYRILLIC SMALL LETTER YAT'
  0xA3  U0451   #       CYRILLIC SMALL LETTER IO
  0xA4  U0454   #       CYRILLIC SMALL LETTER UKRAINIAN IE
  0xA5  U0455   #       CYRILLIC SMALL LETTER DZE
  0xA6  U0456   #       CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
  0xA7  U0457   #       CYRILLIC SMALL LETTER YI
  0xA8  U0458   #       CYRILLIC SMALL LETTER JE
  0xA9  U00AE   #       REGISTERED SIGN
  0xAA  U2122   #       TRADE MARK SIGN
  0xAB  U00AB   #       LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
  0xAC  U0473   #       CYRILLIC SMALL LETTER FITA
  0xAD  U0491   #       CYRILLIC SMALL LETTER GHE WITH UPTURN
  0xAE  U045E   #       CYRILLIC SMALL LETTER SHORT U
  0xAF  U00B4   #       ACUTE ACCENT
  0xB0  U00B0   #       DEGREE SIGN
  0xB1  U0474   #       CYRILLIC CAPITAL LETTER IZHITSA
  0xB2  U0462   #       CYRILLIC CAPITAL LETTER YAT'
  0xB3  U0401   #       CYRILLIC CAPITAL LETTER IO
  0xB4  U0404   #       CYRILLIC CAPITAL LETTER UKRAINIAN IE
  0xB5  U0405   #       CYRILLIC CAPITAL LETTER DZE
  0xB6  U0406   #       CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
  0xB7  U0407   #       CYRILLIC CAPITAL LETTER YI
  0xB8  U0408   #       CYRILLIC CAPITAL LETTER JE
  0xB9  U2116   #       NUMERO SIGN
  0xBA  U00A2   #       CENT SIGN
  0xBB  U00BB   #       RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
  0xBC  U0472   #       CYRILLIC CAPITAL LETTER FITA
  0xBD  U0490   #       CYRILLIC CAPITAL LETTER GHE WITH UPTURN
  0xBE  U040E   #       CYRILLIC CAPITAL LETTER SHORT U
  0xBF  U00A9   #       COPYRIGHT SIGN
  0xC0  U044E   #       CYRILLIC SMALL LETTER YU
  0xC1  U0430   #       CYRILLIC SMALL LETTER A
  0xC2  U0431   #       CYRILLIC SMALL LETTER BE
  0xC3  U0446   #       CYRILLIC SMALL LETTER TSE
  0xC4  U0434   #       CYRILLIC SMALL LETTER DE
  0xC5  U0435   #       CYRILLIC SMALL LETTER IE
  0xC6  U0444   #       CYRILLIC SMALL LETTER EF
  0xC7  U0433   #       CYRILLIC SMALL LETTER GHE
  0xC8  U0445   #       CYRILLIC SMALL LETTER HA
  0xC9  U0438   #       CYRILLIC SMALL LETTER I
  0xCA  U0439   #       CYRILLIC SMALL LETTER SHORT I
  0xCB  U043A   #       CYRILLIC SMALL LETTER KA
  0xCC  U043B   #       CYRILLIC SMALL LETTER EL
  0xCD  U043C   #       CYRILLIC SMALL LETTER EM
  0xCE  U043D   #       CYRILLIC SMALL LETTER EN
  0xCF  U043E   #       CYRILLIC SMALL LETTER O
  0xD0  U043F   #       CYRILLIC SMALL LETTER PE
  0xD1  U044F   #       CYRILLIC SMALL LETTER YA
  0xD2  U0440   #       CYRILLIC SMALL LETTER ER
  0xD3  U0441   #       CYRILLIC SMALL LETTER ES
  0xD4  U0442   #       CYRILLIC SMALL LETTER TE
  0xD5  U0443   #       CYRILLIC SMALL LETTER U
  0xD6  U0436   #       CYRILLIC SMALL LETTER ZHE
  0xD7  U0432   #       CYRILLIC SMALL LETTER VE
  0xD8  U044C   #       CYRILLIC SMALL LETTER SOFT SIGN
  0xD9  U044B   #       CYRILLIC SMALL LETTER YERU
  0xDA  U0437   #       CYRILLIC SMALL LETTER ZE
  0xDB  U0448   #       CYRILLIC SMALL LETTER SHA
  0xDC  U044D   #       CYRILLIC SMALL LETTER E
  0xDD  U0449   #       CYRILLIC SMALL LETTER SHCHA
  0xDE  U0447   #       CYRILLIC SMALL LETTER CHE
  0xDF  U044A   #       CYRILLIC SMALL LETTER HARD SIGN
  0xE0  U042E   #       CYRILLIC CAPITAL LETTER YU
  0xE1  U0410   #       CYRILLIC CAPITAL LETTER A
  0xE2  U0411   #       CYRILLIC CAPITAL LETTER BE
  0xE3  U0426   #       CYRILLIC CAPITAL LETTER TSE
  0xE4  U0414   #       CYRILLIC CAPITAL LETTER DE
  0xE5  U0415   #       CYRILLIC CAPITAL LETTER IE
  0xE6  U0424   #       CYRILLIC CAPITAL LETTER EF
  0xE7  U0413   #       CYRILLIC CAPITAL LETTER GHE
  0xE8  U0425   #       CYRILLIC CAPITAL LETTER HA
  0xE9  U0418   #       CYRILLIC CAPITAL LETTER I
  0xEA  U0419   #       CYRILLIC CAPITAL LETTER SHORT I
  0xEB  U041A   #       CYRILLIC CAPITAL LETTER KA
  0xEC  U041B   #       CYRILLIC CAPITAL LETTER EL
  0xED  U041C   #       CYRILLIC CAPITAL LETTER EM
  0xEE  U041D   #       CYRILLIC CAPITAL LETTER EN
  0xEF  U041E   #       CYRILLIC CAPITAL LETTER O
  0xF0  U041F   #       CYRILLIC CAPITAL LETTER PE
  0xF1  U042F   #       CYRILLIC CAPITAL LETTER YA
  0xF2  U0420   #       CYRILLIC CAPITAL LETTER ER
  0xF3  U0421   #       CYRILLIC CAPITAL LETTER ES
  0xF4  U0422   #       CYRILLIC CAPITAL LETTER TE
  0xF5  U0423   #       CYRILLIC CAPITAL LETTER U
  0xF6  U0416   #       CYRILLIC CAPITAL LETTER ZHE
  0xF7  U0412   #       CYRILLIC CAPITAL LETTER VE
  0xF8  U042C   #       CYRILLIC CAPITAL LETTER SOFT SIGN
  0xF9  U042B   #       CYRILLIC CAPITAL LETTER YERU
  0xFA  U0417   #       CYRILLIC CAPITAL LETTER ZE
  0xFB  U0428   #       CYRILLIC CAPITAL LETTER SHA
  0xFC  U042D   #       CYRILLIC CAPITAL LETTER E
  0xFD  U0429   #       CYRILLIC CAPITAL LETTER SHCHA
  0xFE  U0427   #       CYRILLIC CAPITAL LETTER CHE
  0xFF  U042A   #       CYRILLIC CAPITAL LETTER HARD SIGN

Security Considerations

  This memo raises no known security issues.

Acknowledgments

  The author is grateful to Markus Kuhn (Computer Science
Laboratory, University of Cambridge, UK) for help on creating the
KOI8-C encoding table.

References

  [1]  Chernov, A., "Registration of a Cyrillic Character Set", RFC
           1489, July 1993.

  [2]  KOI8-U Ukrainian Character Set, RFC 2319. 1998.

Author's Address

   Serge Winitzki
   4 Arizona Ter. #2
   Arlington, MA 02474
   USA


Html markup produced by rfcmarkup 1.128b, available from https://tools.ietf.org/tools/rfcmarkup/