draft-ietf-idn-cjk-00.txt   draft-ietf-idn-cjk-01.txt 
Internet Draft James SENG Internet Draft James SENG
<draft-ietf-idn-cjk-00.txt> Yoshiro YONEYA <draft-ietf-idn-cjk-01.txt> Yoshiro YONEYA
12th Sep 2000 Kenny HUANG 11th Apr 2001 Kenny HUANG
Expires 12 Mar 2001 KIM Kyongsok Expires 11 Oct 2001 KIM Kyongsok
Han Ideograph (CJK) for Internationalized Domain Names Han Ideograph (CJK) for Internationalized Domain Names
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC2026. with all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Internet-Drafts are working documents of the Internet
Engineering Task Force (IETF), its areas, and its working Engineering Task Force (IETF), its areas, and its working
skipping to change at line 57 skipping to change at line 57
1. Definition and convention 1. Definition and convention
Characters mentioned in this document are identified by their position Characters mentioned in this document are identified by their position
or code point in the Unicode character set [UCS]. The notation U+12AB, or code point in the Unicode character set [UCS]. The notation U+12AB,
for example, indicates the character at the position 12AB (hexadecimal) for example, indicates the character at the position 12AB (hexadecimal)
in the [UCS]. It is strongly recommended that a [UCS] table is available in the [UCS]. It is strongly recommended that a [UCS] table is available
for reference for the ideograph described. for reference for the ideograph described.
Han ideographs are defined as the Chinese ideographs starting from Han ideographs are defined as the Chinese ideographs starting from
U+3400 to U+9FFF or commonly known as CJK Unification Ideographs. This U+3400 to U+9FFF or commonly known as CJK Unification Ideographs. This
Expires 12th March 2001 [Page 1]
covers Chinese 'hanzi' {U+6F22 U+5B57/U+6C49 U+5B57}, Japanese 'kanji' covers Chinese 'hanzi' {U+6F22 U+5B57/U+6C49 U+5B57}, Japanese 'kanji'
(U+6F22 U+5B57) and Korean 'hanja' {U+6F22 U+5B57/U+D55C U+C790}. (U+6F22 U+5B57) and Korean 'hanja' {U+6F22 U+5B57/U+D55C U+C790}.
Additional Han ideographs will appear in other location (not necessary Additional Han ideographs will appear in other location (not necessary
in plane 0) in the future. in plane 0) in the future.
Conversion between ideographs can be done using four different Conversion between ideographs can be done using four different
approaches: Code-base substitution, character-based substitution, approaches: Code-base substitution, character-based substitution,
lexicon-based substitution and context-based substitution. Han folding lexicon-based substitution and context-based substitution. Han folding
refers only to code-base substitution, similar to case mapping of refers only to code-base substitution, similar to case mapping of
alphabetic characters. alphabetic characters.
skipping to change at line 114 skipping to change at line 111
and evolve, the form of the ideographs sometimes differs slightly from and evolve, the form of the ideographs sometimes differs slightly from
country to country. For example, the word 'villa' {U+838A} 'zhuang' in country to country. For example, the word 'villa' {U+838A} 'zhuang' in
Chinese, in Japanese is 'sou' {U+8358}. These are given different code Chinese, in Japanese is 'sou' {U+8358}. These are given different code
points in Unicode. points in Unicode.
3. Chinese (Hanzi) 3. Chinese (Hanzi)
Chinese ideographs or hanzi {U+6F22 U+5B57/U+6C49 U+5B57} originated Chinese ideographs or hanzi {U+6F22 U+5B57/U+6C49 U+5B57} originated
from pictograph. They are 'pictures' which evolved into ideographs from pictograph. They are 'pictures' which evolved into ideographs
during several thousand years. For instance, the ideograph for "hill" during several thousand years. For instance, the ideograph for "hill"
Expires 12th March 2001 [Page 2]
{U+5C71} still bears some resembles to 3 peaks of a hill. {U+5C71} still bears some resembles to 3 peaks of a hill.
Not all ideographs are pictograph. There are other classifications such Not all ideographs are pictograph. There are other classifications such
as compound ideographs, phonetic ideographs etc. For example, as compound ideographs, phonetic ideographs etc. For example,
'endurance' {U+5FCD} is a pierced 'knife' {U+5200} above the 'heart' 'endurance' {U+5FCD} is a pierced 'knife' {U+5200} above the 'heart'
{U+5FC3}, or as a Chinese saying goes, 'endurance is like having a {U+5FC3}, or as a Chinese saying goes, 'endurance is like having a
pierced knife in your heart'. pierced knife in your heart'.
Hence, almost all Han ideographs are associated with some meaning by Hence, almost all Han ideographs are associated with some meaning by
itself which is very different from most other scripts. This causes some itself which is very different from most other scripts. This causes some
skipping to change at line 171 skipping to change at line 165
In domain names, we are particularly interested in is to equivalences In domain names, we are particularly interested in is to equivalences
comparison of the names, and not converting SC-to-TC. Therefore, for comparison of the names, and not converting SC-to-TC. Therefore, for
this purpose, it is possible that equivalency matching be done in the this purpose, it is possible that equivalency matching be done in the
TC-to-SC folding prior to comparison, similar to lower-case English TC-to-SC folding prior to comparison, similar to lower-case English
strings before comparing them, e.g. 'taiwan' SC {U+53F0 U+6E7E} will strings before comparing them, e.g. 'taiwan' SC {U+53F0 U+6E7E} will
match with TC {U+81FA U+5F4E} or TC {U+53F0 U+5F4E}. match with TC {U+81FA U+5F4E} or TC {U+53F0 U+5F4E}.
The side effect of this method is that comparing SC {U+53D1} to TC The side effect of this method is that comparing SC {U+53D1} to TC
{U+667C} or TC {U+9AEE} will both be positive. This implies that SC {U+667C} or TC {U+9AEE} will both be positive. This implies that SC
'hair' SC {U+5934 U+53D1} will match TC 'hair' SC {U+5934 U+53D1} will match TC
Expires 12th March 2001 [Page 3]
(U+982D U+9AEE). It will also match TC {U+982D U+9AEE} that does not (U+982D U+9AEE). It will also match TC {U+982D U+9AEE} that does not
have any meaning in Chinese. have any meaning in Chinese.
It should also be noted that SC are not used together with TC. Hence, It should also be noted that SC are not used together with TC. Hence,
'hair' is either written as SC {U+5934 U+53D1} or TC {U+982D U+9AEE} 'hair' is either written as SC {U+5934 U+53D1} or TC {U+982D U+9AEE}
but (almost) never {U+5934 U+9AEE} or {U+982D U+53D1}. So the problem but (almost) never {U+5934 U+9AEE} or {U+982D U+53D1}. So the problem
of SC and TC may not too serious for IDN. of SC and TC may not too serious for IDN.
Unfortunately, when it comes to names in Chinese, places where SC are Unfortunately, when it comes to names in Chinese, places where SC are
used (i.e. Singapore and China), traditional and simplified ideographs used (i.e. Singapore and China), traditional and simplified ideographs
skipping to change at line 228 skipping to change at line 219
Korean also invented their own ideographs that are called 'gugja' Korean also invented their own ideographs that are called 'gugja'
{U+56FD U+5B57/U+AD6D U+C790}. {U+56FD U+5B57/U+AD6D U+C790}.
5. Japanese (Kanji, Hiragana, Katakana) 5. Japanese (Kanji, Hiragana, Katakana)
Japanese adopted Chinese ideograph from the Korean and the Chinese since Japanese adopted Chinese ideograph from the Korean and the Chinese since
the 5th century. Chinese ideographs in Japanese are known as 'kanji' the 5th century. Chinese ideographs in Japanese are known as 'kanji'
{U+6F22 U+5B57}. They also developed their own syllabary hiragana {U+6F22 U+5B57}. They also developed their own syllabary hiragana
{U+5E73 U+4EEE U+540D} (U+3040-U+309F) and katakana {U+7247 U+4EEE {U+5E73 U+4EEE U+540D} (U+3040-U+309F) and katakana {U+7247 U+4EEE
U+540D} (U+30A0-U+30FF), both are derivative of kanji that has same U+540D} (U+30A0-U+30FF), both are derivative of kanji that has same
Expires 12th March 2001 [Page 4]
pronunciation. Hiragana is a simplified cursive form, for example, 'a' pronunciation. Hiragana is a simplified cursive form, for example, 'a'
{U+3042} was derived from 'an' {U+5B89}. Katakana is a simplified part {U+3042} was derived from 'an' {U+5B89}. Katakana is a simplified part
form, for example, 'a' {U+30A2} was derived from 'a' {U+963F}. However, form, for example, 'a' {U+30A2} was derived from 'a' {U+963F}. However,
kanji all remain very integrated within the Japanese language. kanji all remain very integrated within the Japanese language.
Japanese also invented ideographs known as 'kokuji' {U+56FD U+5B57}. For Japanese also invented ideographs known as 'kokuji' {U+56FD U+5B57}. For
example, 'iwashi' {U+9C2F} is a Japanese kokuji ideograph. Kokuji are example, 'iwashi' {U+9C2F} is a Japanese kokuji ideograph. Kokuji are
invented according to Han ligature rules. For example, 'touge' "mountain invented according to Han ligature rules. For example, 'touge' "mountain
pass" {U+5CE0} is a conjunction of meaning with 'yama' "mountain" pass" {U+5CE0} is a conjunction of meaning with 'yama' "mountain"
{U+5C71} + 'ue' "up" {U+4E0A} + 'shita' "down" {U+4E0B}. {U+5C71} + 'ue' "up" {U+4E0A} + 'shita' "down" {U+4E0B}.
skipping to change at line 285 skipping to change at line 273
For example, 'both' {U+5169} was simplified {U+4E21}. Note that Chinese For example, 'both' {U+5169} was simplified {U+4E21}. Note that Chinese
simplified it to {U+4E24} instead. However, traditional Japanese kanji simplified it to {U+4E24} instead. However, traditional Japanese kanji
are seldom used nowadays beyond documenting old historical text that are seldom used nowadays beyond documenting old historical text that
they are treated different from the more commonly used simplified form, they are treated different from the more commonly used simplified form,
or used to express proper noun such as person's name or trademarks. or used to express proper noun such as person's name or trademarks.
Hence, Han folding here is not recommended. Hence, Han folding here is not recommended.
4. Vietnamese 4. Vietnamese
While Vietnamese also adopted Chinese ideographs ('chu han') and created While Vietnamese also adopted Chinese ideographs ('chu han') and created
Expires 12th March 2001 [Page 5]
their own ideographs ('chu nom'), they were now replaced by romanized their own ideographs ('chu nom'), they were now replaced by romanized
'quoc ngu' today. Hence, this document does not attempt to address any 'quoc ngu' today. Hence, this document does not attempt to address any
issues with 'chu han' or 'chu nom'. issues with 'chu han' or 'chu nom'.
5. zVariant 5. zVariant
Unicode has a three dimension conceptual model to Ideograph Unicode has a three dimension conceptual model to Ideograph
Unification. The three dimensions are semantic (X axis - meaning, Unification. The three dimensions are semantic (X axis - meaning,
function), abstract shape (Y-axis - general form) and actual shape function), abstract shape (Y-axis - general form) and actual shape
(Z-axis instantiated, type-faced). (Z-axis instantiated, type-faced).
skipping to change at line 340 skipping to change at line 325
recommended for domain names use. recommended for domain names use.
It should be noted that the Unicode Consortium never intended the It should be noted that the Unicode Consortium never intended the
ideographic description to be used in protocols like IDN where exact ideographic description to be used in protocols like IDN where exact
comparison must be done. But it is certainly desirable to this feature comparison must be done. But it is certainly desirable to this feature
as it is commons for Chinese to invent ideographs for names by adding as it is commons for Chinese to invent ideographs for names by adding
or removing radical from standard ideographs. or removing radical from standard ideographs.
7. Mechanism 7. Mechanism
Expires 12th March 2001 [Page 6]
The implicit proposal in this document is that CJKV ideographs may or The implicit proposal in this document is that CJKV ideographs may or
may not be "folded" for the purposes of comparison of domain names. may not be "folded" for the purposes of comparison of domain names.
But if folding is required, there are four different ways that this But if folding is required, there are four different ways that this
folding could be done. folding could be done.
a) Folding by DNS clients, or by user agents a) Folding by DNS clients, or by user agents
b) Folding by DNS servers b) Folding by DNS servers
c) Folding by Domain Name registration services for the purposes of c) Folding by Domain Name registration services for the purposes of
preventing confusing allocations CJKV Domain Names which would, preventing confusing allocations CJKV Domain Names which would,
skipping to change at line 393 skipping to change at line 376
Shift-JIS characters only". Shift-JIS characters only".
If conservative safety is really required, then If conservative safety is really required, then
1) find the x-axis characters which are available in all major CJK 1) find the x-axis characters which are available in all major CJK
character sets used on the internet; character sets used on the internet;
2) only allow variants of those in domain names; 2) only allow variants of those in domain names;
3) when one variant is used, no other can be allocated. So comparisons 3) when one variant is used, no other can be allocated. So comparisons
are made on x-axis characters, but the license of that domain name are made on x-axis characters, but the license of that domain name
can pick which y or z variants they wish to use.. can pick which y or z variants they wish to use..
Expires 12th March 2001 [Page 7]
Acknowledgement Acknowledgement
The editor gratefully acknowledge the contributions of: The editor gratefully acknowledge the contributions of:
Paul Hoffman <phoffman@imc.org> Paul Hoffman <phoffman@imc.org>
Jiang Mingliang <jiang@i-DNS.net> Jiang Mingliang <jiang@i-DNS.net>
Dongman Lee <dlee@icu.ac.kr> Dongman Lee <dlee@icu.ac.kr>
Karlsson Kent <keka@im.se> Karlsson Kent <keka@im.se>
Author(s) Author(s)
James SENG James SENG
i-DNS.net International Pte Ltd. i-DNS.net International Pte Ltd.
8 Temasek Boulevard 8 Temasek Boulevard
Suntec Tower 3 #24-02 Suntec Tower 3 #24-02
Singapore 038988 Singapore 038988
Email: James@Seng.cc Email: James@Seng.cc
Tel: +65 2468208 Tel: +65 2468208
Yoshiro YONEYA Yoshiro YONEYA
NTT Software Corporation NTT Software Corporation
Shinagawa IntercityBldg., B-13F Shinagawa IntercityBldg., B-13F
2-15-2 Kohnan, Minato-ku Tokyo 108-6113 Japan 2-15-2 Kohnan, Minato-ku Tokyo 108-6113 Japan
Email: yone@po.ntts.co.jp Email: yone@po.ntts.co.jp
Tel: +81-3-5782-7291 Tel: +81-3-5782-7291
Kenny HUANG Kenny HUANG
Geotempo International Ltd; TWNIC Geotempo International Ltd; TWNIC
3F, No 16 Kang Hwa Street, Nei Hu 3F, No 16 Kang Hwa Street, Nei Hu
Taipei 114, Taiwan Taipei 114, Taiwan
Email: huangk@alum.sinica.edu Email: huangk@alum.sinica.edu
Tel: +886-2-2658-6510 Tel: +886-2-2658-6510
KIM Kyongsok/GIM Gyeongseog KIM Kyongsok/GIM Gyeongseog
References References
skipping to change at line 450 skipping to change at line 431
[CJKV] CJKV Information Processing ISBN 1-56592-224-7 [CJKV] CJKV Information Processing ISBN 1-56592-224-7
[C2C] The pitfalls and Complexities of Chinese to Chinese [C2C] The pitfalls and Complexities of Chinese to Chinese
Conversion. http://www.basistech.com/articles/C2C.html, Conversion. http://www.basistech.com/articles/C2C.html,
Jack Halpern, Jouni Kerman Jack Halpern, Jouni Kerman
[KANJIDIC] Sanseidos Unicode Kanji Information Dictionary [KANJIDIC] Sanseidos Unicode Kanji Information Dictionary
ISBN 4-385-13690-4 ISBN 4-385-13690-4
Expires 12th March 2001 [Page 8]
[UNICHART] Unicode chart http://charts.unicode.org/ [UNICHART] Unicode chart http://charts.unicode.org/
[ZONGBIAO] Simplified Characters Standard Chart 2nd Edition, 1986 [ZONGBIAO] Simplified Characters Standard Chart 2nd Edition, 1986
[UNIHAN] Unicode Han Database, Unicode Consortium [UNIHAN] Unicode Han Database, Unicode Consortium
ftp://ftp.unicode.org/Public/UNIDATA/Unihan.txt ftp://ftp.unicode.org/Public/UNIDATA/Unihan.txt
[ISO11941] ISO TS 11941: Information and documentation [ISO11941] ISO TS 11941: Information and documentation
Transliteration of Korean script into Latin characters. Transliteration of Korean script into Latin characters.
Technical Specification 11941. First edition. 1996-12-31. Technical Specification 11941. First edition. 1996-12-31.
skipping to change at line 474 skipping to change at line 453
[KimK 1990] "A New Proposal for a Standard Hangeul (or Korean Script) [KimK 1990] "A New Proposal for a Standard Hangeul (or Korean Script)
Code", KIM Kyongsok. Computer Standards & Interfaces, Code", KIM Kyongsok. Computer Standards & Interfaces,
Vol. 9, No. 3, pp. 187-202, 1990. Vol. 9, No. 3, pp. 187-202, 1990.
[KimK 1992] "A common Approach to Designing the Hangeul Code and [KimK 1992] "A common Approach to Designing the Hangeul Code and
Keyboard", KIM Kyongsok. Computer Standards & Interfaces, Keyboard", KIM Kyongsok. Computer Standards & Interfaces,
Vol. 14, No. 4, pp. 297-325, Aug. 1992. Vol. 14, No. 4, pp. 297-325, Aug. 1992.
[KimK 1999] A Hangeul story inside computers. KIM, Kyongsok. Busan [KimK 1999] A Hangeul story inside computers. KIM, Kyongsok. Busan
National University Press. 1999. [in Hangeul] National University Press. 1999. [in Hangeul]
Expires 12th March 2001 [Page 9]
 End of changes. 12 change blocks. 
27 lines changed or deleted 6 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/